โ ๏ธ This blog post was created with the help of AI tools. Yes, I used a bit of magic from language models to organize my thoughts and automate the boring parts, but the geeky fun and the ๐ค in C# are 100% mine.
Hi ๐
What if you could build aย real-time voice conversation appย in .NET โ speech-to-text, text-to-speech, voice activity detection, and LLM responses โ all runningย locally on your machine?
That’s exactly what ElBruno.Realtime does.
๐ฅ Watch the full video here (coming soon)
Why I Built This
I’ve been building local AI tools for .NET for a while โ local embeddings, local TTS with VibeVoice and QwenTTS, and more. But what was missing was the glue: a framework that chains VAD โ STT โ LLM โ TTS into a single, pluggable pipeline.
I wanted something that:
- Followsย Microsoft.Extensions.AIย patterns (no proprietary abstractions)
- Usesย Dependency Injectionย like any modern .NET app
- Lets youย swap any componentย โ Whisper for STT, Kokoro or QwenTTS for TTS, Foundry Local or Ollama for chat
- Auto-downloads modelsย on first run โ no manual setup
- Supports bothย one-shotย andย real-time streamingย conversations
So I built it. ๐
The Architecture
ElBruno.Realtime uses a three-layer architecture:
Your App โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ RealtimeConversationPipeline โ โ Orchestration Layerโ (Chains VAD โ STT โ LLM โ TTS) โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ โ โ โ Silero Whisper Ollama Kokoro/Qwen/VibeVoice VAD STT Chat TTS
Every component implements a standard interface โ ISpeechToTextClient (from M.E.AI), ITextToSpeechClient, IVoiceActivityDetector, IChatClient โ so they’re independently replaceable.
Two processing modes:
ProcessTurnAsyncย โ One-shot: give it a WAV file, get back transcription + AI response + audioConverseAsyncย โ Streaming: pipe live microphone audio, get real-time events asยIAsyncEnumerable<ConversationEvent>
NuGet Packages
| Package | What it does |
|---|---|
ElBruno.Realtime | Core pipeline + abstractions |
ElBruno.Realtime.Whisper | Whisper.net STT (GGML models) |
ElBruno.Realtime.SileroVad | Silero VAD via ONNX Runtime |
ElBruno.KokoroTTS.Realtime | Kokoro-82M TTS (~320 MB, fast) |
ElBruno.QwenTTS.Realtime | QwenTTS (~5.5 GB, high quality) |
ElBruno.VibeVoiceTTS.Realtime | VibeVoice TTS (~1.5 GB) |
All models auto-download on first use. No manual steps. ๐ฆ
Show Me the Code
Minimal Console App โ One-Shot Conversation
This is the simplest possible setup. Record a question, get an AI response with audio:
using Microsoft.Extensions.DependencyInjection;using Microsoft.Extensions.AI;var services = new ServiceCollection();// Wire up the pipelineservices.AddPersonaPlexRealtime(opts =>{ opts.DefaultSystemPrompt = "You are a helpful assistant. Keep responses brief."; opts.DefaultLanguage = "en-US";}).UseWhisperStt("whisper-tiny.en") // 75 MB model, auto-downloaded.UseSileroVad() // ~2 MB model.UseKokoroTts(); // ~320 MB model// Add any IChatClient โ here we use Ollamaservices.AddChatClient( new OllamaChatClient(new Uri("http://localhost:11434"), "phi4-mini"));var provider = services.BuildServiceProvider();var conversation = provider.GetRequiredService<IRealtimeConversationClient>();// Process a WAV fileusing var audio = File.OpenRead("question.wav");var turn = await conversation.ProcessTurnAsync(audio);Console.WriteLine($"๐ You said: {turn.UserText}");Console.WriteLine($"๐ค AI replied: {turn.ResponseText}");Console.WriteLine($"โฑ๏ธ Processing time: {turn.ProcessingTime.TotalMilliseconds:F0}ms");
That’s it. First run downloads models automatically. After that, everything runs locally.
Real-Time Streaming โ Live Microphone
For real-time conversations, ConverseAsync gives you an IAsyncEnumerable<ConversationEvent> that streams events as they happen:
await foreach (var evt in conversation.ConverseAsync( microphoneAudioStream, new ConversationOptions { SystemPrompt = "You are a friendly voice assistant.", SessionId = "user-123", // Per-user conversation history EnableBargeIn = true, // Allow interrupting MaxConversationHistory = 20, })){ switch (evt.Kind) { case ConversationEventKind.SpeechDetected: Console.WriteLine("๐ค Speech detected..."); break; case ConversationEventKind.TranscriptionComplete: Console.WriteLine($"๐ You: {evt.TranscribedText}"); break; case ConversationEventKind.ResponseTextChunk: Console.Write(evt.ResponseText); // Streams token by token break; case ConversationEventKind.ResponseAudioChunk: // Play audio chunk in real-time audioPlayer.EnqueueChunk(evt.ResponseAudio); break; case ConversationEventKind.ResponseComplete: Console.WriteLine("\nโ
Response complete"); break; }}
The pipeline handles everything:
- Silero VADย detects when you start/stop speaking
- Whisperย transcribes your speech
- Ollamaย generates a response (streamed)
- Kokoro/QwenTTSย converts the response to audio (streamed)
All async. All streaming. All local.
ASP.NET Core API + SignalR
Want to expose this as a web API? Here’s the setup:
var builder = WebApplication.CreateBuilder(args);builder.Services.AddPersonaPlexRealtime(opts =>{ opts.DefaultSystemPrompt = "You are a helpful assistant.";}).UseWhisperStt("whisper-tiny.en").UseSileroVad().UseKokoroTts();builder.Services.AddChatClient( new OllamaChatClient(new Uri("http://localhost:11434"), "phi4-mini"));builder.Services.AddSignalR();var app = builder.Build();// REST endpoint for one-shot turnsapp.MapPost("/api/conversation/turn", async ( HttpRequest request, IRealtimeConversationClient conversation) =>{ var form = await request.ReadFormAsync(); var audioFile = form.Files["audio"]; using var audioStream = audioFile!.OpenReadStream(); var turn = await conversation.ProcessTurnAsync(audioStream); return Results.Ok(new { userText = turn.UserText, responseText = turn.ResponseText, processingTimeMs = turn.ProcessingTime.TotalMilliseconds, });});app.Run();
And a SignalR hub for real-time streaming:
public class ConversationHub : Hub{ private readonly IRealtimeConversationClient _conversation; public ConversationHub(IRealtimeConversationClient conversation) => _conversation = conversation; public async IAsyncEnumerable<ConversationEventDto> StreamConversation( IAsyncEnumerable<byte[]> audioChunks, string? systemPrompt = null) { await foreach (var evt in _conversation.ConverseAsync( audioChunks, new ConversationOptions { SystemPrompt = systemPrompt })) { yield return new ConversationEventDto { Kind = evt.Kind.ToString(), TranscribedText = evt.TranscribedText, ResponseText = evt.ResponseText, Timestamp = evt.Timestamp, }; } }}
Swap TTS Engines in One Line
One of the things I love about this design โ changing the TTS engine is literally one line:
// Option 1: Kokoro โ fast, ~320 MB.UseKokoroTts(defaultVoice: "af_heart")// Option 2: QwenTTS โ high quality, ~5.5 GB.UseQwenTts()// Option 3: VibeVoice โ balanced, ~1.5 GB.UseVibeVoiceTts(defaultVoice: "Carter")
Same goes for STT โ switch from tiny to base model for better accuracy:
// Fast (75 MB).UseWhisperStt("whisper-tiny.en")// More accurate (142 MB).UseWhisperStt("whisper-base.en")
Models โ All Auto-Downloaded
No manual model management. First run might take a moment to download, after that everything is cached locally:
| Model | Size | Purpose |
|---|---|---|
| Silero VAD v5 | ~2 MB | Detect when you’re speaking |
| Whisper tiny.en | ~75 MB | Fast speech-to-text |
| Whisper base.en | ~142 MB | Accurate speech-to-text |
| Kokoro-82M | ~320 MB | Fast text-to-speech |
| VibeVoice | ~1.5 GB | Balanced text-to-speech |
| QwenTTS | ~5.5 GB | High-quality text-to-speech |
| Phi4-Mini (Ollama) | ~2.7 GB | LLM chat (manual: ollama pull phi4-mini) |
Models are cached at %LOCALAPPDATA%/ElBruno/Realtime/.
Per-User Sessions
The framework includes built-in conversation history with per-user session management:
var turn = await conversation.ProcessTurnAsync( audioStream, new ConversationOptions { SessionId = "user-456", // Each user gets their own history MaxConversationHistory = 50, // Sliding window SystemPrompt = "You remember context from our previous messages.", });
InMemoryConversationSessionStore is the default โ or inject your own IConversationSessionStore for Redis, database, etc.
What’s Next
I have a few things on my mind:
- More STT engines (faster-whisper, Azure Speech)
- WebRTC transport for browser-to-server streaming
- .NET Aspire integration sample (scenario-03 is already in progress!)
- Performance benchmarks across TTS engines
- Full support for Foundry Local
Resources
- ๐ฆ GitHub repo:ย https://github.com/elbruno/ElBruno.Realtime
- ๐ฆ NuGet:ย ElBruno.Realtime
- ๐ Microsoft.Extensions.AI:ย Official docs
- ๐๏ธ Related post:ย Local AI Voices in .NET โ VibeVoice & Qwen TTS
Happy coding!
Greetings
El Bruno
More posts in my blog ElBruno.com.
More info in https://beacons.ai/elbruno

Leave a comment