โ ๏ธ This blog post was created with the help of AI tools. Yes, I used a bit of magic from language models to organize my thoughts and automate the boring parts, but the geeky fun and the ๐ค in C# are 100% mine.
Hi!
So Google just droppedย Gemma 4ย โ their most capable open model family yet โ and I couldn’t resist. I spent a good chunk of time digging into the architecture, trying to convert models, hitting walls, finding workarounds, and hitting more walls. Here’s where things stand withย ElBruno.LocalLLMs.
Spoiler: the library isย readyย for Gemma 4. The ONNX runtime… not yet. So, let me tell you the whole story.
Wait, What’s Gemma 4?
Google released four new models on April 2, 2026, and they’re pretty wild:
| Model | Parameters | What’s Cool | Context |
|---|---|---|---|
| E2B IT | 5.1B (only 2.3B active!) | Tiny but punches above its weight | 128K |
| E4B IT | 8B (4.5B active) | Sweet spot for most use cases | 128K |
| 26B A4B IT | 25.2B (3.8B active) | MoE โ only fires 3.8B params per token ๐คฏ | 256K |
| 31B IT | 30.7B | The big one, dense, no tricks | 256K |
The magic sauce is something called Per-Layer Embeddings (PLE) โ basically, each transformer layer gets its own little embedding input. That’s how a 5.1B model acts like a 2.3B one. Clever stuff.
They’re allย Apache 2.0. No gating, no license hoops. I like that.
What I Got Working (v0.8.0)
โ Model Definitions โ Done
All four Gemma 4 variants are registered and ready to go:
var options = new LocalLLMsOptions
{
Model = KnownModels.Gemma4E2BIT // Smallest, edge-optimized
};
I added Gemma4E2BIT, Gemma4E4BIT, Gemma4_26BA4BIT, and Gemma4_31BIT. The moment ONNX models exist, you just point and shoot.
โ Chat Template โ Already Works
Here’s the fun part: Gemma 4 uses the exact same chat template as Gemma 2 and 3:
<start_of_turn>userWhat is the capital of France?<end_of_turn><start_of_turn>model
My existing GemmaFormatter handles it perfectly. Zero code changes needed. System messages fold into the first user turn, tool calling works โ the whole thing just… works. I love when that happens.
โ Tool Calling โ Yep, That Too
Gemma 4 natively supports function calling, and my formatter already handles the Gemma tool-calling format with proper JSON function definitions. No changes needed.
โ Tests โ A Lot of Them
I went a bit overboard here (no regrets, and thanks Copilot!):
- 6 model definition testsย โ making sure all four variants are correctly registered
- 9 tool-calling testsย โ validating function calling scenarios with Gemma 4
- 195 multilingual testsย โ this one deserves its own section (see below)
All 697 tests pass. โ
โ Conversion Scripts โ Ready and Waiting
I wrote dedicated Python and PowerShell conversion scripts:
python scripts/convert_gemma4.py --model-size e2b --output-dir ./models/gemma4-e2b
They’re ready. They just need a runtime that can handle Gemma 4. Which brings me to…
โณ The Honest Part: ONNX Conversion Is Blocked ๐
OK, here’s where I hit a wall.ย The ONNX conversion doesn’t work yet.ย
( I maybe missing something here, but hey, it’s a long weekend !)
What’s the Problem?
Gemma 4 has three architectural features that onnxruntime-genai v0.12.2 simply doesn’t support:
- Per-Layer Embeddings (PLE)ย โ each layer needs a separateย
per_layer_inputsย tensor. The runtime expects one embedding output. Not three dozen. - Variable Head Dimensionsย โ sliding attention layers useย
head_dim=256, full attention layers (every 5th one) useย512. The runtime config only has ONEยhead_sizeย field. Pick one? Yeah, no. - KV Cache Sharingย โ 35 layers share only 15 unique KV cache pairs. The runtime expects a 1:1 mapping. Math doesn’t math.
What I Tried (The Fun Part)
Here’s my adventure:
- ๐งย Patched the GenAI builderย to route Gemma 4 through the Gemma 3 pipeline โ it actually produced a 1.6GB ONNX file! But then the runtime choked with a shape mismatch at the full attention layers. So close.
- ๐ย Examined the onnx-community modelsย โ they have the right structure, but the I/O format is incompatible with GenAI’s KV cache management.
- ๐งชย Tried loading asย
Gemma4ForCausalLMย โ nope, weights are stored under a multimodal prefix. Mismatch everywhere. - ๐ย Searched for pre-release buildsย โ nothing. 0.12.2 is the latest.
- ๐ย Checked GitHub issues/PRsย โ zero Gemma 4 mentions in the repo.
So When Will It Work?
The moment onnxruntime-genai adds Gemma 4 support, I’m ready to go:
- Model definitions โ
- Chat template โ
- Tests โ
- Conversion scripts โ
- Documentation โ
I’m watching:ย microsoft/onnxruntime-genai releases
Bonus: I Went Multilingual
While I was in testing mode, I figured โ why not make sure all my formatters handle every language properly? So I added 195 multilingual tests covering:
| Script/Language | Examples |
|---|---|
| CJK | ๆฅๆฌ่ช, ไธญๆ, ํ๊ตญ์ด |
| Cyrillic | ะ ัััะบะธะน |
| Arabic | ุงูุนุฑุจูุฉ (RTL) |
| Hebrew | ืขืืจืืช (RTL) |
| Devanagari | เคนเคฟเคจเฅเคฆเฅ |
| Tamil | เฎคเฎฎเฎฟเฎดเฏ |
| Thai | เนเธเธข |
| European | ร, ร, ร, ลฝ, ฤ |
| Emoji | ๐ค, ๐, ๐ |
| Zero-width | ZWJ, ZWNJ characters |
All 7 formatters (ChatML, Phi3, Llama3, Qwen, Mistral, Gemma, DeepSeek) handle Unicode correctly. If you’re running models locally, you probably care about this. I know I do.
Try It Out
Grab v0.8.0:
dotnet add package ElBruno.LocalLLMs --version 0.8.0
Gemma 4 ONNX models aren’t ready yet, but there areย 25+ other modelsย that work right now:
using ElBruno.LocalLLMs;
using Microsoft.Extensions.AI;
// Gemma 2 works great today
var options = new LocalLLMsOptions { Model = KnownModels.Gemma2_2BIT };
using var client = await LocalChatClient.CreateAsync(options);
var response = await client.GetResponseAsync([
new(ChatRole.User, "Tell me about Gemma 4!")
]);
Console.WriteLine(response.Text);
Links
- ๐ฆย NuGet Package
- ๐ย Supported Models
- ๐งย ONNX Conversion Guide
- ๐ซย Blocked Models Reference
- ๐ย Google Gemma 4 Announcement
- ๐ย GitHub Repository
Happy coding!
Greetings
El Bruno
More posts in my blog ElBruno.com.
More info in https://beacons.ai/elbruno

Leave a reply to elbruno Cancel reply