โš ๏ธ This blog post was created with the help of AI tools. Yes, I used a bit of magic from language models to organize my thoughts and automate the boring parts, but the geeky fun and the ๐Ÿค– in C# are 100% mine.

Hi ๐Ÿ‘‹

What if you could build aย real-time voice conversation appย in .NET โ€” speech-to-text, text-to-speech, voice activity detection, and LLM responses โ€” all runningย locally on your machine?

That’s exactly what ElBruno.Realtime does.

๐ŸŽฅ Watch the full video here (coming soon)

Why I Built This

I’ve been building local AI tools for .NET for a while โ€” local embeddingslocal TTS with VibeVoice and QwenTTS, and more. But what was missing was the glue: a framework that chains VAD โ†’ STT โ†’ LLM โ†’ TTS into a single, pluggable pipeline.

I wanted something that:

  • Followsย Microsoft.Extensions.AIย patterns (no proprietary abstractions)
  • Usesย Dependency Injectionย like any modern .NET app
  • Lets youย swap any componentย โ€” Whisper for STT, Kokoro or QwenTTS for TTS, Foundry Local or Ollama for chat
  • Auto-downloads modelsย on first run โ€” no manual setup
  • Supports bothย one-shotย andย real-time streamingย conversations

So I built it. ๐Ÿš€

The Architecture

ElBruno.Realtime uses a three-layer architecture:

Your App
โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ RealtimeConversationPipeline โ”‚ โ† Orchestration Layer
โ”‚ (Chains VAD โ†’ STT โ†’ LLM โ†’ TTS) โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ†“ โ†“ โ†“ โ†“
Silero Whisper Ollama Kokoro/Qwen/VibeVoice
VAD STT Chat TTS

Every component implements a standard interface โ€” ISpeechToTextClient (from M.E.AI), ITextToSpeechClientIVoiceActivityDetectorIChatClient โ€” so they’re independently replaceable.

Two processing modes:

  • ProcessTurnAsyncย โ€” One-shot: give it a WAV file, get back transcription + AI response + audio
  • ConverseAsyncย โ€” Streaming: pipe live microphone audio, get real-time events asย IAsyncEnumerable<ConversationEvent>

NuGet Packages

PackageWhat it does
ElBruno.RealtimeCore pipeline + abstractions
ElBruno.Realtime.WhisperWhisper.net STT (GGML models)
ElBruno.Realtime.SileroVadSilero VAD via ONNX Runtime
ElBruno.KokoroTTS.RealtimeKokoro-82M TTS (~320 MB, fast)
ElBruno.QwenTTS.RealtimeQwenTTS (~5.5 GB, high quality)
ElBruno.VibeVoiceTTS.RealtimeVibeVoice TTS (~1.5 GB)

All models auto-download on first use. No manual steps. ๐Ÿ“ฆ

Show Me the Code

Minimal Console App โ€” One-Shot Conversation

This is the simplest possible setup. Record a question, get an AI response with audio:

using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.AI;
var services = new ServiceCollection();
// Wire up the pipeline
services.AddPersonaPlexRealtime(opts =>
{
opts.DefaultSystemPrompt = "You are a helpful assistant. Keep responses brief.";
opts.DefaultLanguage = "en-US";
})
.UseWhisperStt("whisper-tiny.en") // 75 MB model, auto-downloaded
.UseSileroVad() // ~2 MB model
.UseKokoroTts(); // ~320 MB model
// Add any IChatClient โ€” here we use Ollama
services.AddChatClient(
new OllamaChatClient(new Uri("http://localhost:11434"), "phi4-mini"));
var provider = services.BuildServiceProvider();
var conversation = provider.GetRequiredService<IRealtimeConversationClient>();
// Process a WAV file
using var audio = File.OpenRead("question.wav");
var turn = await conversation.ProcessTurnAsync(audio);
Console.WriteLine($"๐Ÿ“ You said: {turn.UserText}");
Console.WriteLine($"๐Ÿค– AI replied: {turn.ResponseText}");
Console.WriteLine($"โฑ๏ธ Processing time: {turn.ProcessingTime.TotalMilliseconds:F0}ms");

That’s it. First run downloads models automatically. After that, everything runs locally.

Real-Time Streaming โ€” Live Microphone

For real-time conversations, ConverseAsync gives you an IAsyncEnumerable<ConversationEvent> that streams events as they happen:

await foreach (var evt in conversation.ConverseAsync(
microphoneAudioStream,
new ConversationOptions
{
SystemPrompt = "You are a friendly voice assistant.",
SessionId = "user-123", // Per-user conversation history
EnableBargeIn = true, // Allow interrupting
MaxConversationHistory = 20,
}))
{
switch (evt.Kind)
{
case ConversationEventKind.SpeechDetected:
Console.WriteLine("๐ŸŽค Speech detected...");
break;
case ConversationEventKind.TranscriptionComplete:
Console.WriteLine($"๐Ÿ“ You: {evt.TranscribedText}");
break;
case ConversationEventKind.ResponseTextChunk:
Console.Write(evt.ResponseText); // Streams token by token
break;
case ConversationEventKind.ResponseAudioChunk:
// Play audio chunk in real-time
audioPlayer.EnqueueChunk(evt.ResponseAudio);
break;
case ConversationEventKind.ResponseComplete:
Console.WriteLine("\nโœ… Response complete");
break;
}
}

The pipeline handles everything:

  1. Silero VADย detects when you start/stop speaking
  2. Whisperย transcribes your speech
  3. Ollamaย generates a response (streamed)
  4. Kokoro/QwenTTSย converts the response to audio (streamed)

All async. All streaming. All local.

ASP.NET Core API + SignalR

Want to expose this as a web API? Here’s the setup:

var builder = WebApplication.CreateBuilder(args);
builder.Services.AddPersonaPlexRealtime(opts =>
{
opts.DefaultSystemPrompt = "You are a helpful assistant.";
})
.UseWhisperStt("whisper-tiny.en")
.UseSileroVad()
.UseKokoroTts();
builder.Services.AddChatClient(
new OllamaChatClient(new Uri("http://localhost:11434"), "phi4-mini"));
builder.Services.AddSignalR();
var app = builder.Build();
// REST endpoint for one-shot turns
app.MapPost("/api/conversation/turn", async (
HttpRequest request,
IRealtimeConversationClient conversation) =>
{
var form = await request.ReadFormAsync();
var audioFile = form.Files["audio"];
using var audioStream = audioFile!.OpenReadStream();
var turn = await conversation.ProcessTurnAsync(audioStream);
return Results.Ok(new
{
userText = turn.UserText,
responseText = turn.ResponseText,
processingTimeMs = turn.ProcessingTime.TotalMilliseconds,
});
});
app.Run();

And a SignalR hub for real-time streaming:

public class ConversationHub : Hub
{
private readonly IRealtimeConversationClient _conversation;
public ConversationHub(IRealtimeConversationClient conversation)
=> _conversation = conversation;
public async IAsyncEnumerable<ConversationEventDto> StreamConversation(
IAsyncEnumerable<byte[]> audioChunks,
string? systemPrompt = null)
{
await foreach (var evt in _conversation.ConverseAsync(
audioChunks,
new ConversationOptions { SystemPrompt = systemPrompt }))
{
yield return new ConversationEventDto
{
Kind = evt.Kind.ToString(),
TranscribedText = evt.TranscribedText,
ResponseText = evt.ResponseText,
Timestamp = evt.Timestamp,
};
}
}
}

Swap TTS Engines in One Line

One of the things I love about this design โ€” changing the TTS engine is literally one line:

// Option 1: Kokoro โ€” fast, ~320 MB
.UseKokoroTts(defaultVoice: "af_heart")
// Option 2: QwenTTS โ€” high quality, ~5.5 GB
.UseQwenTts()
// Option 3: VibeVoice โ€” balanced, ~1.5 GB
.UseVibeVoiceTts(defaultVoice: "Carter")

Same goes for STT โ€” switch from tiny to base model for better accuracy:

// Fast (75 MB)
.UseWhisperStt("whisper-tiny.en")
// More accurate (142 MB)
.UseWhisperStt("whisper-base.en")

Models โ€” All Auto-Downloaded

No manual model management. First run might take a moment to download, after that everything is cached locally:

ModelSizePurpose
Silero VAD v5~2 MBDetect when you’re speaking
Whisper tiny.en~75 MBFast speech-to-text
Whisper base.en~142 MBAccurate speech-to-text
Kokoro-82M~320 MBFast text-to-speech
VibeVoice~1.5 GBBalanced text-to-speech
QwenTTS~5.5 GBHigh-quality text-to-speech
Phi4-Mini (Ollama)~2.7 GBLLM chat (manual: ollama pull phi4-mini)

Models are cached at %LOCALAPPDATA%/ElBruno/Realtime/.

Per-User Sessions

The framework includes built-in conversation history with per-user session management:

var turn = await conversation.ProcessTurnAsync(
audioStream,
new ConversationOptions
{
SessionId = "user-456", // Each user gets their own history
MaxConversationHistory = 50, // Sliding window
SystemPrompt = "You remember context from our previous messages.",
});

InMemoryConversationSessionStore is the default โ€” or inject your own IConversationSessionStore for Redis, database, etc.

What’s Next

I have a few things on my mind:

  • More STT engines (faster-whisper, Azure Speech)
  • WebRTC transport for browser-to-server streaming
  • .NET Aspire integration sample (scenario-03 is already in progress!)
  • Performance benchmarks across TTS engines
  • Full support for Foundry Local

Resources

Happy coding!

Greetings

El Bruno

More posts in my blog ElBruno.com.

More info in https://beacons.ai/elbruno


Leave a comment

Discover more from El Bruno

Subscribe now to keep reading and get access to the full archive.

Continue reading