⚠️ This blog post was created with the help of AI tools. Yes, I used a bit of magic from language models to organize my thoughts and automate the boring parts, but the geeky fun and the 🤖 in C# are 100% mine.

Hi!

So Google just dropped Gemma 4 — their most capable open model family yet — and I couldn’t resist. I spent a good chunk of time digging into the architecture, trying to convert models, hitting walls, finding workarounds, and hitting more walls. Here’s where things stand with ElBruno.LocalLLMs.

Spoiler: the library is ready for Gemma 4. The ONNX runtime… not yet. So, let me tell you the whole story.


Wait, What’s Gemma 4?

Google released four new models on April 2, 2026, and they’re pretty wild:

ModelParametersWhat’s CoolContext
E2B IT5.1B (only 2.3B active!)Tiny but punches above its weight128K
E4B IT8B (4.5B active)Sweet spot for most use cases128K
26B A4B IT25.2B (3.8B active)MoE — only fires 3.8B params per token 🤯256K
31B IT30.7BThe big one, dense, no tricks256K

The magic sauce is something called Per-Layer Embeddings (PLE) — basically, each transformer layer gets its own little embedding input. That’s how a 5.1B model acts like a 2.3B one. Clever stuff.

They’re all Apache 2.0. No gating, no license hoops. I like that.


What I Got Working (v0.8.0)

✅ Model Definitions — Done

All four Gemma 4 variants are registered and ready to go:

var options = new LocalLLMsOptions
{
    Model = KnownModels.Gemma4E2BIT  // Smallest, edge-optimized
};

I added Gemma4E2BITGemma4E4BITGemma4_26BA4BIT, and Gemma4_31BIT. The moment ONNX models exist, you just point and shoot.

✅ Chat Template — Already Works

Here’s the fun part: Gemma 4 uses the exact same chat template as Gemma 2 and 3:

<start_of_turn>user
What is the capital of France?<end_of_turn>
<start_of_turn>model

My existing GemmaFormatter handles it perfectly. Zero code changes needed. System messages fold into the first user turn, tool calling works — the whole thing just… works. I love when that happens.

✅ Tool Calling — Yep, That Too

Gemma 4 natively supports function calling, and my formatter already handles the Gemma tool-calling format with proper JSON function definitions. No changes needed.

✅ Tests — A Lot of Them

I went a bit overboard here (no regrets, and thanks Copilot!):

  • 6 model definition tests — making sure all four variants are correctly registered
  • 9 tool-calling tests — validating function calling scenarios with Gemma 4
  • 195 multilingual tests — this one deserves its own section (see below)

All 697 tests pass. ✅

✅ Conversion Scripts — Ready and Waiting

I wrote dedicated Python and PowerShell conversion scripts:

python scripts/convert_gemma4.py --model-size e2b --output-dir ./models/gemma4-e2b

They’re ready. They just need a runtime that can handle Gemma 4. Which brings me to…


⏳ The Honest Part: ONNX Conversion Is Blocked 😔

OK, here’s where I hit a wall. The ONNX conversion doesn’t work yet. 
( I maybe missing something here, but hey, it’s a long weekend !)

What’s the Problem?

Gemma 4 has three architectural features that onnxruntime-genai v0.12.2 simply doesn’t support:

  1. Per-Layer Embeddings (PLE) — each layer needs a separate per_layer_inputs tensor. The runtime expects one embedding output. Not three dozen.
  2. Variable Head Dimensions — sliding attention layers use head_dim=256, full attention layers (every 5th one) use 512. The runtime config only has ONE head_size field. Pick one? Yeah, no.
  3. KV Cache Sharing — 35 layers share only 15 unique KV cache pairs. The runtime expects a 1:1 mapping. Math doesn’t math.

What I Tried (The Fun Part)

Here’s my adventure:

  • 🔧 Patched the GenAI builder to route Gemma 4 through the Gemma 3 pipeline — it actually produced a 1.6GB ONNX file! But then the runtime choked with a shape mismatch at the full attention layers. So close.
  • 🔍 Examined the onnx-community models — they have the right structure, but the I/O format is incompatible with GenAI’s KV cache management.
  • 🧪 Tried loading as Gemma4ForCausalLM — nope, weights are stored under a multimodal prefix. Mismatch everywhere.
  • 🔎 Searched for pre-release builds — nothing. 0.12.2 is the latest.
  • 📋 Checked GitHub issues/PRs — zero Gemma 4 mentions in the repo.

So When Will It Work?

The moment onnxruntime-genai adds Gemma 4 support, I’m ready to go:

  • Model definitions ✅
  • Chat template ✅
  • Tests ✅
  • Conversion scripts ✅
  • Documentation ✅

I’m watching: microsoft/onnxruntime-genai releases


Bonus: I Went Multilingual

While I was in testing mode, I figured — why not make sure all my formatters handle every language properly? So I added 195 multilingual tests covering:

Script/LanguageExamples
CJK日本語, 中文, 한국어
CyrillicРусский
Arabicالعربية (RTL)
Hebrewעברית (RTL)
Devanagariहिन्दी
Tamilதமிழ்
Thaiไทย
EuropeanÑ, Ü, Ø, Ž, Ą
Emoji🤖, 👋, 🌍
Zero-widthZWJ, ZWNJ characters

All 7 formatters (ChatML, Phi3, Llama3, Qwen, Mistral, Gemma, DeepSeek) handle Unicode correctly. If you’re running models locally, you probably care about this. I know I do.


Try It Out

Grab v0.8.0:

dotnet add package ElBruno.LocalLLMs --version 0.8.0

Gemma 4 ONNX models aren’t ready yet, but there are 25+ other models that work right now:

using ElBruno.LocalLLMs;
using Microsoft.Extensions.AI;

// Gemma 2 works great today
var options = new LocalLLMsOptions { Model = KnownModels.Gemma2_2BIT };
using var client = await LocalChatClient.CreateAsync(options);
var response = await client.GetResponseAsync([
    new(ChatRole.User, "Tell me about Gemma 4!")
]);
Console.WriteLine(response.Text);

Links

Happy coding!

Greetings

El Bruno

More posts in my blog ElBruno.com.

More info in https://beacons.ai/elbruno


Leave a comment

Discover more from El Bruno

Subscribe now to keep reading and get access to the full archive.

Continue reading