โš ๏ธ This blog post was created with the help of AI tools. Yes, I used a bit of magic from language models to organize my thoughts and automate the boring parts, but the geeky fun and the ๐Ÿค– in C# are 100% mine.

Hi!

So Google just droppedย Gemma 4ย โ€” their most capable open model family yet โ€” and I couldn’t resist. I spent a good chunk of time digging into the architecture, trying to convert models, hitting walls, finding workarounds, and hitting more walls. Here’s where things stand withย ElBruno.LocalLLMs.

Spoiler: the library isย readyย for Gemma 4. The ONNX runtime… not yet. So, let me tell you the whole story.


Wait, What’s Gemma 4?

Google released four new models on April 2, 2026, and they’re pretty wild:

ModelParametersWhat’s CoolContext
E2B IT5.1B (only 2.3B active!)Tiny but punches above its weight128K
E4B IT8B (4.5B active)Sweet spot for most use cases128K
26B A4B IT25.2B (3.8B active)MoE โ€” only fires 3.8B params per token ๐Ÿคฏ256K
31B IT30.7BThe big one, dense, no tricks256K

The magic sauce is something called Per-Layer Embeddings (PLE) โ€” basically, each transformer layer gets its own little embedding input. That’s how a 5.1B model acts like a 2.3B one. Clever stuff.

They’re allย Apache 2.0. No gating, no license hoops. I like that.


What I Got Working (v0.8.0)

โœ… Model Definitions โ€” Done

All four Gemma 4 variants are registered and ready to go:

var options = new LocalLLMsOptions
{
    Model = KnownModels.Gemma4E2BIT  // Smallest, edge-optimized
};

I added Gemma4E2BITGemma4E4BITGemma4_26BA4BIT, and Gemma4_31BIT. The moment ONNX models exist, you just point and shoot.

โœ… Chat Template โ€” Already Works

Here’s the fun part: Gemma 4 uses the exact same chat template as Gemma 2 and 3:

<start_of_turn>user
What is the capital of France?<end_of_turn>
<start_of_turn>model

My existing GemmaFormatter handles it perfectly. Zero code changes needed. System messages fold into the first user turn, tool calling works โ€” the whole thing just… works. I love when that happens.

โœ… Tool Calling โ€” Yep, That Too

Gemma 4 natively supports function calling, and my formatter already handles the Gemma tool-calling format with proper JSON function definitions. No changes needed.

โœ… Tests โ€” A Lot of Them

I went a bit overboard here (no regrets, and thanks Copilot!):

  • 6 model definition testsย โ€” making sure all four variants are correctly registered
  • 9 tool-calling testsย โ€” validating function calling scenarios with Gemma 4
  • 195 multilingual testsย โ€” this one deserves its own section (see below)

All 697 tests pass. โœ…

โœ… Conversion Scripts โ€” Ready and Waiting

I wrote dedicated Python and PowerShell conversion scripts:

python scripts/convert_gemma4.py --model-size e2b --output-dir ./models/gemma4-e2b

They’re ready. They just need a runtime that can handle Gemma 4. Which brings me to…


โณ The Honest Part: ONNX Conversion Is Blocked ๐Ÿ˜”

OK, here’s where I hit a wall.ย The ONNX conversion doesn’t work yet.ย 
( I maybe missing something here, but hey, it’s a long weekend !)

What’s the Problem?

Gemma 4 has three architectural features that onnxruntime-genai v0.12.2 simply doesn’t support:

  1. Per-Layer Embeddings (PLE)ย โ€” each layer needs a separateย per_layer_inputsย tensor. The runtime expects one embedding output. Not three dozen.
  2. Variable Head Dimensionsย โ€” sliding attention layers useย head_dim=256, full attention layers (every 5th one) useย 512. The runtime config only has ONEย head_sizeย field. Pick one? Yeah, no.
  3. KV Cache Sharingย โ€” 35 layers share only 15 unique KV cache pairs. The runtime expects a 1:1 mapping. Math doesn’t math.

What I Tried (The Fun Part)

Here’s my adventure:

  • ๐Ÿ”งย Patched the GenAI builderย to route Gemma 4 through the Gemma 3 pipeline โ€” it actually produced a 1.6GB ONNX file! But then the runtime choked with a shape mismatch at the full attention layers. So close.
  • ๐Ÿ”ย Examined the onnx-community modelsย โ€” they have the right structure, but the I/O format is incompatible with GenAI’s KV cache management.
  • ๐Ÿงชย Tried loading asย Gemma4ForCausalLMย โ€” nope, weights are stored under a multimodal prefix. Mismatch everywhere.
  • ๐Ÿ”Žย Searched for pre-release buildsย โ€” nothing. 0.12.2 is the latest.
  • ๐Ÿ“‹ย Checked GitHub issues/PRsย โ€” zero Gemma 4 mentions in the repo.

So When Will It Work?

The moment onnxruntime-genai adds Gemma 4 support, I’m ready to go:

  • Model definitions โœ…
  • Chat template โœ…
  • Tests โœ…
  • Conversion scripts โœ…
  • Documentation โœ…

I’m watching:ย microsoft/onnxruntime-genai releases


Bonus: I Went Multilingual

While I was in testing mode, I figured โ€” why not make sure all my formatters handle every language properly? So I added 195 multilingual tests covering:

Script/LanguageExamples
CJKๆ—ฅๆœฌ่ชž, ไธญๆ–‡, ํ•œ๊ตญ์–ด
Cyrillicะ ัƒััะบะธะน
Arabicุงู„ุนุฑุจูŠุฉ (RTL)
Hebrewืขื‘ืจื™ืช (RTL)
Devanagariเคนเคฟเคจเฅเคฆเฅ€
Tamilเฎคเฎฎเฎฟเฎดเฏ
Thaiเน„เธ—เธข
Europeanร‘, รœ, ร˜, ลฝ, ฤ„
Emoji๐Ÿค–, ๐Ÿ‘‹, ๐ŸŒ
Zero-widthZWJ, ZWNJ characters

All 7 formatters (ChatML, Phi3, Llama3, Qwen, Mistral, Gemma, DeepSeek) handle Unicode correctly. If you’re running models locally, you probably care about this. I know I do.


Try It Out

Grab v0.8.0:

dotnet add package ElBruno.LocalLLMs --version 0.8.0

Gemma 4 ONNX models aren’t ready yet, but there areย 25+ other modelsย that work right now:

using ElBruno.LocalLLMs;
using Microsoft.Extensions.AI;

// Gemma 2 works great today
var options = new LocalLLMsOptions { Model = KnownModels.Gemma2_2BIT };
using var client = await LocalChatClient.CreateAsync(options);
var response = await client.GetResponseAsync([
    new(ChatRole.User, "Tell me about Gemma 4!")
]);
Console.WriteLine(response.Text);

Links

Happy coding!

Greetings

El Bruno

More posts in my blog ElBruno.com.

More info in https://beacons.ai/elbruno


4 responses to “๐ŸŒŸ Gemma 4 Is Here โ€” And My C# Library Is (Almost) Ready”

  1. diomedes1905 Avatar
    diomedes1905

    Can I use it as a local AI agent for code assistance, or should I look in other direction?

    Like

    1. Yes, that’s one of the scenarios. Use any of the supported local LLMs to support a local agent for code, or similar.

      Phi-4-mini-instruct is already supported, and I can try to add >> Qwen2.5-Coder-7B / 7B-Instruct

      Let me cook something and show how to use it in VSCode

      Liked by 1 person

      1. diomedes1905 Avatar
        diomedes1905

        Yes please, I’m mostly a C# developer and sometimes the online agents give me bad recommendations because they are optimized for JavaScript, python and other languages.

        Like

  2. […] Build QR Codes in .NET FAST with ElBruno.QRCodeGenera (Bruno Capuano) […]

    Like

Leave a reply to Dew Drop – April 6, 2026 (#4640) – Morning Dew by Alvin Ashcraft Cancel reply

Discover more from El Bruno

Subscribe now to keep reading and get access to the full archive.

Continue reading