⚠️ This blog post was created with the help of AI tools. Yes, I used a bit of magic from language models to organize my thoughts and automate the boring parts, but the geeky fun and the 🤖 in C# are 100% mine.

TL;DR

This was the final ElBruno.NetAgent experiment: GitHub Copilot CLI + SQUAD using Azure OpenAI GPT-5.5 BYOK against the same app-building challenge I previously tried with CPU-only local models, GPU local models, and GPT-5-mini.

The good news: GPT-5.5 was clearly better at staying inside phase boundaries, following safety rules, reducing broad stabilization loops, and working with strict quality gates. Once I switched to economy-mode prompts, the flow became much more disciplined: usually one implementation agent, one reviewer, short loops, and no wild repo-wide rewrites.

The not-so-funny news: better models do not remove the need for engineering discipline. Build green, tests green, and smoke tests green still missed a real manual UX bug: the Settings UI looked like it saved Auto Mode, but the persisted config did not actually keep the value. The fix was not “more tests”; it was the right test: one that exercised the real WPF user path, not only the ViewModel.

Also: GPT-5.5 BYOK is powerful, but not cheap. By the time the app reached final automated validation, the run was already around 15M+ tokens and roughly $30+ in Azure OpenAI usage. SQUAD is useful, but every agent has a context tax.

The biggest lesson: the code is still cheap. The validation strategy is the expensive part.


The Setup

Main repo:

Previous experiments:

  1. CPU-only local model: https://elbruno.com/2026/05/03/running-github-copilot-cli-offline-with-local-models-a-cpu-only-reality-check/
  2. GPU local model: https://elbruno.com/2026/05/06/running-github-copilot-cli-offline-with-local-models-gpu-edition/
  3. GPT-5-mini BYOK: https://elbruno.com/2026/05/11/github-copilot-cli-gpt-5-mini-byok-the-code-was-cheap-the-quality-gates-were-expensive/

This time the goal was simple:

Run the same ElBruno.NetAgent app-building experiment using GitHub Copilot CLI + SQUAD + Azure OpenAI GPT-5.5 BYOK and compare stabilization loops against GPT-5-mini.

The stack:

  • GitHub Copilot CLI
  • SQUAD orchestration
  • Azure OpenAI GPT-5.5 BYOK
  • .NET + WPF
  • Windows tray app
  • Strict quality gates
  • Manual UX validation

Step 0: Define “Working” Before Writing Code

One of the biggest lessons from the GPT-5-mini run was that “build green” was nowhere near enough.

So before writing any product code, the experiment started with a dedicated Step 0 phase:

  • Normalize SQUAD routing to GPT-5.5
  • Define team rules
  • Define safety guardrails
  • Define quality gates
  • Define manual UX validation
  • Prevent uncontrolled multi-agent fan-out

This turned out to be one of the smartest decisions in the whole experiment.

The model was better, but the stronger engineering structure mattered even more.


The Phase-Based Build

Instead of asking the model to “build the app,” the project was split into very strict phases:

PhaseGoal
Phase 1Minimal .NET solution skeleton
Phase 2Core options, services, DI, safe defaults
Phase 3WPF shell and smoke-testable windows
Phase 4Editable Settings behavior
Phase 5Read-only Network Selector and dry-run logic
Phase 6Tray shell, Help/About, clean exit
Phase 7Status view and product polish
Phase 8Final automated validation
Phase 9Manual UX bug fixing

This worked much better than broad “vibe coding.”

GPT-5.5 was noticeably better at respecting:

  • phase boundaries
  • “do not implement” instructions
  • safety rules
  • dry-run requirements
  • deterministic testing constraints

Economy Mode Was the Real Unlock

Early SQUAD runs were expensive and noisy.

Multiple agents reading large repo context repeatedly created a huge token tax.

The biggest improvement came after switching to what I started calling “economy mode”:

  • one implementation agent
  • one reviewer
  • no Scribe unless needed
  • no doc rewrites
  • no template rewrites
  • read only files needed for the phase
  • small targeted edits
  • stop after the phase

This dramatically reduced chaos and stabilization loops.

Ironically, the better the model became, the more important project-management-style prompting became.


The Cost Checkpoint

By the time the app reached final automated validation:

  • ~15.5M tokens consumed
  • 400+ requests
  • ~$30+ estimated Azure OpenAI cost

Most of the usage was input/context tokens, not output tokens.

That’s an important lesson:

SQUAD adds value, but every agent carries context overhead.

GPT-5.5 BYOK is powerful, but it is definitely not “free vibe coding.”


The Quality Gates Worked… Until They Didn’t

The automated gate looked solid:

dotnet clean .\ElBruno.NetAgent.sln
dotnet build .\ElBruno.NetAgent.sln -c Release
dotnet test .\ElBruno.NetAgent.sln -c Release --no-build --settings disable-parallel.runsettings
dotnet run --project src\ElBruno.NetAgent -- --smoke-test
dotnet run --project src\ElBruno.NetAgent -- --smoke-test-exit
dotnet run --project src\ElBruno.NetAgent -- --smoke-test-settings-save

And honestly, GPT-5.5 handled this surprisingly well.

The project reached:

  • 39+ passing tests
  • fast deterministic runs
  • smoke-test validation
  • WPF window construction validation
  • tray exit validation
  • forbidden-command safety scans

But then manual UX validation found a real bug.


The Manual UX Bug That Mattered

Manual testing showed:

  • tray menu looked OK
  • Help worked
  • Exit worked
  • Settings UI looked rough
  • Save did not persist Auto Mode

At first this was confusing because:

  • tests were green
  • smoke tests were green
  • SettingsViewModel tests passed

The problem was deeper: the tests validated the ViewModel, but not the real user path.

The real failure path was:

UI checkbox
→ Save and Close button
→ persisted config file
→ reload
→ UI state

The fix was not “more tests.”

The fix was:

  • testing the actual WPF interaction path
  • validating persisted config values
  • validating reload behavior
  • validating the same SaveAndCloseAsync() path used by the real button

That was probably the most important lesson of the entire experiment.

The right test is the one that fails for the same reason your user fails.


What GPT-5.5 Did Better Than GPT-5-mini

GPT-5.5 was clearly better at:

  • phase discipline
  • respecting safety boundaries
  • avoiding repo-wide chaos
  • using abstractions/fakes
  • isolating WPF/tray behavior
  • deterministic testing patterns
  • targeted fixes
  • reviewer coordination

The tray + exit phase was especially interesting.

GPT-5-mini had previously struggled around:

  • NotifyIcon behavior
  • WPF dispatcher behavior
  • tray shutdown
  • lingering processes

GPT-5.5 handled this much more cleanly once the prompts explicitly enforced:

  • fake adapters
  • no native tray tests
  • no real dispatcher loops
  • deterministic test behavior

But stronger models still did not remove the need for:

  • manual UX testing
  • safety reviews
  • persistence validation
  • engineering checkpoints

What I Would Do Differently Next Time

A few practical lessons:

  • Start with economy mode immediately
  • Use one implementer + one reviewer by default
  • Treat UI-path tests as mandatory
  • Validate persisted config files directly
  • Manual UX validation should happen earlier
  • Smoke tests should validate real user flows
  • Track token usage per phase
  • Keep phases narrow and explicit

Most importantly:

Define what “working” means before generating code.


Final Thoughts

This experiment ended in a much better place than the GPT-5-mini run.

The stabilization loops were smaller. The orchestration was cleaner. The engineering flow was more controlled.

But the biggest lesson did not change.

The hard part was never generating code.

The hard part was:

  • defining quality
  • validating behavior
  • testing real user flows
  • controlling orchestration
  • knowing when “green” was lying

GPT-5.5 made the engineering loop better.

It did not eliminate the need for engineering discipline.

And that is probably the most important AI-assisted development lesson I’ve learned so far:

Today, the code is cheap.
The decisions, validation, and engineering discipline are still expensive.

Happy coding!

Greetings

El Bruno

More posts in my blog ElBruno.com.

More info in https://beacons.ai/elbruno


Leave a comment

Discover more from El Bruno

Subscribe now to keep reading and get access to the full archive.

Continue reading