Building a local AI voice assistant from scratch
I got tired of paying for things I couldn't control.
Not in a paranoid way — practically. Every cloud AI assistant is a service. The pricing changes, the API limits change, the terms change. And the latency is always there, even when it's fast. There's a round trip somewhere, always.
So I built my own.
What it is
VESSEL is a physical AI presence for my workshop — Indigo-Nx, my home development setup in Northampton. It runs on the desktop machine, entirely local, entirely offline. Wake word detection, speech-to-text, a local language model, text-to-speech — all of it on hardware I own, processing data that never leaves the room.
The name fits. It's a container. It holds the intelligence without being the intelligence.
The pipeline
The architecture is straightforward once you accept that you're assembling components rather than building from scratch:
Microphone
→ OpenWakeWord ("hey jarvis")
→ VAD (voice activity detection)
→ whisper-cli.exe (Whisper STT — local)
→ MQTT → hub
→ Ollama (gemma3:4b — local LLM)
→ MQTT → piper.exe (TTS — local)
→ sounddevice playback
Everything communicates over MQTT on localhost. The hub is a FastAPI service that orchestrates state, serves the dashboard, and handles the WebSocket connections for the UI.
The dashboard
The dashboard is cyberpunk on purpose. Scanlines, neon glow, monospace fonts, a real-time audio spectrum tied to the assistant's state. It's 1024×600, fixed — built to sit on a secondary display without being touched.
It shows what VESSEL is doing at any given moment: idle, listening, thinking, speaking. It shows sensor data from MQTT (presence detection, distance, temperature). It shows the conversation transcript, system metrics, and a weather widget.
It's the kind of thing that would look at home in a film about someone who actually knows what they're doing.
What it runs on
- Whisper (tiny/small English model) for speech recognition — fast enough that you don't notice the delay
- Ollama running gemma3:4b for the language model — genuinely capable for a 4B parameter model
- Piper with a Norman voice for TTS — warm, fast, local
- OpenWakeWord for wake word detection — "hey jarvis", tuned threshold
The whole thing starts with a PowerShell script and stops cleanly with another.
What it proved
The architecture works. Local voice AI is viable on consumer hardware. The latency is acceptable — under two seconds from wake word to the start of the response on a mid-range GPU.
What it didn't have was any real personality. The system prompt was a static string: name, location, style. It forgot everything between sessions. It didn't know what you'd said to it yesterday, let alone last week.
That was the next problem to solve.
The personality fork — persistent memory, mood, intent routing, and Spotify control — is covered in the next post.