⚠️ This blog post was created with the help of AI tools. Yes, I used a bit of magic from language models to organize my thoughts and automate the boring parts, but the geeky fun and the 🤖 in C# are 100% mine.
Hi!
So Google just dropped Gemma 4 — their most capable open model family yet — and I couldn’t resist. I spent a good chunk of time digging into the architecture, trying to convert models, hitting walls, finding workarounds, and hitting more walls. Here’s where things stand with ElBruno.LocalLLMs.
Spoiler: the library is ready for Gemma 4. The ONNX runtime… not yet. So, let me tell you the whole story.
Wait, What’s Gemma 4?
Google released four new models on April 2, 2026, and they’re pretty wild:
| Model | Parameters | What’s Cool | Context |
|---|---|---|---|
| E2B IT | 5.1B (only 2.3B active!) | Tiny but punches above its weight | 128K |
| E4B IT | 8B (4.5B active) | Sweet spot for most use cases | 128K |
| 26B A4B IT | 25.2B (3.8B active) | MoE — only fires 3.8B params per token 🤯 | 256K |
| 31B IT | 30.7B | The big one, dense, no tricks | 256K |
The magic sauce is something called Per-Layer Embeddings (PLE) — basically, each transformer layer gets its own little embedding input. That’s how a 5.1B model acts like a 2.3B one. Clever stuff.
They’re all Apache 2.0. No gating, no license hoops. I like that.
What I Got Working (v0.8.0)
✅ Model Definitions — Done
All four Gemma 4 variants are registered and ready to go:
var options = new LocalLLMsOptions
{
Model = KnownModels.Gemma4E2BIT // Smallest, edge-optimized
};
I added Gemma4E2BIT, Gemma4E4BIT, Gemma4_26BA4BIT, and Gemma4_31BIT. The moment ONNX models exist, you just point and shoot.
✅ Chat Template — Already Works
Here’s the fun part: Gemma 4 uses the exact same chat template as Gemma 2 and 3:
<start_of_turn>userWhat is the capital of France?<end_of_turn><start_of_turn>model
My existing GemmaFormatter handles it perfectly. Zero code changes needed. System messages fold into the first user turn, tool calling works — the whole thing just… works. I love when that happens.
✅ Tool Calling — Yep, That Too
Gemma 4 natively supports function calling, and my formatter already handles the Gemma tool-calling format with proper JSON function definitions. No changes needed.
✅ Tests — A Lot of Them
I went a bit overboard here (no regrets, and thanks Copilot!):
- 6 model definition tests — making sure all four variants are correctly registered
- 9 tool-calling tests — validating function calling scenarios with Gemma 4
- 195 multilingual tests — this one deserves its own section (see below)
All 697 tests pass. ✅
✅ Conversion Scripts — Ready and Waiting
I wrote dedicated Python and PowerShell conversion scripts:
python scripts/convert_gemma4.py --model-size e2b --output-dir ./models/gemma4-e2b
They’re ready. They just need a runtime that can handle Gemma 4. Which brings me to…
⏳ The Honest Part: ONNX Conversion Is Blocked 😔
OK, here’s where I hit a wall. The ONNX conversion doesn’t work yet.
( I maybe missing something here, but hey, it’s a long weekend !)
What’s the Problem?
Gemma 4 has three architectural features that onnxruntime-genai v0.12.2 simply doesn’t support:
- Per-Layer Embeddings (PLE) — each layer needs a separate
per_layer_inputstensor. The runtime expects one embedding output. Not three dozen. - Variable Head Dimensions — sliding attention layers use
head_dim=256, full attention layers (every 5th one) use512. The runtime config only has ONEhead_sizefield. Pick one? Yeah, no. - KV Cache Sharing — 35 layers share only 15 unique KV cache pairs. The runtime expects a 1:1 mapping. Math doesn’t math.
What I Tried (The Fun Part)
Here’s my adventure:
- 🔧 Patched the GenAI builder to route Gemma 4 through the Gemma 3 pipeline — it actually produced a 1.6GB ONNX file! But then the runtime choked with a shape mismatch at the full attention layers. So close.
- 🔍 Examined the onnx-community models — they have the right structure, but the I/O format is incompatible with GenAI’s KV cache management.
- 🧪 Tried loading as
Gemma4ForCausalLM— nope, weights are stored under a multimodal prefix. Mismatch everywhere. - 🔎 Searched for pre-release builds — nothing. 0.12.2 is the latest.
- 📋 Checked GitHub issues/PRs — zero Gemma 4 mentions in the repo.
So When Will It Work?
The moment onnxruntime-genai adds Gemma 4 support, I’m ready to go:
- Model definitions ✅
- Chat template ✅
- Tests ✅
- Conversion scripts ✅
- Documentation ✅
I’m watching: microsoft/onnxruntime-genai releases
Bonus: I Went Multilingual
While I was in testing mode, I figured — why not make sure all my formatters handle every language properly? So I added 195 multilingual tests covering:
| Script/Language | Examples |
|---|---|
| CJK | 日本語, 中文, 한국어 |
| Cyrillic | Русский |
| Arabic | العربية (RTL) |
| Hebrew | עברית (RTL) |
| Devanagari | हिन्दी |
| Tamil | தமிழ் |
| Thai | ไทย |
| European | Ñ, Ü, Ø, Ž, Ą |
| Emoji | 🤖, 👋, 🌍 |
| Zero-width | ZWJ, ZWNJ characters |
All 7 formatters (ChatML, Phi3, Llama3, Qwen, Mistral, Gemma, DeepSeek) handle Unicode correctly. If you’re running models locally, you probably care about this. I know I do.
Try It Out
Grab v0.8.0:
dotnet add package ElBruno.LocalLLMs --version 0.8.0
Gemma 4 ONNX models aren’t ready yet, but there are 25+ other models that work right now:
using ElBruno.LocalLLMs;
using Microsoft.Extensions.AI;
// Gemma 2 works great today
var options = new LocalLLMsOptions { Model = KnownModels.Gemma2_2BIT };
using var client = await LocalChatClient.CreateAsync(options);
var response = await client.GetResponseAsync([
new(ChatRole.User, "Tell me about Gemma 4!")
]);
Console.WriteLine(response.Text);
Links
- 📦 NuGet Package
- 📖 Supported Models
- 🔧 ONNX Conversion Guide
- 🚫 Blocked Models Reference
- 🌐 Google Gemma 4 Announcement
- 🐙 GitHub Repository
Happy coding!
Greetings
El Bruno
More posts in my blog ElBruno.com.
More info in https://beacons.ai/elbruno

Leave a comment