PUBLIC04 March 20263 min read

Building a local AI voice assistant from scratch

vesselaivoicepythonollama

I got tired of paying for things I couldn't control.

Not in a paranoid way — practically. Every cloud AI assistant is a service. The pricing changes, the API limits change, the terms change. And the latency is always there, even when it's fast. There's a round trip somewhere, always.

So I built my own.

What it is

VESSEL is a physical AI presence for the workshop — the Indigo-Nx home development setup. It runs on the desktop machine, entirely local, entirely offline. Wake word detection, speech-to-text, a local language model, text-to-speech — all of it on hardware I own, processing data that never leaves the room.

The name fits. It's a container. It holds the intelligence without being the intelligence.

The pipeline

The architecture is straightforward once you accept that you're assembling components rather than building from scratch:

Microphone
  → OpenWakeWord ("hey jarvis")
  → VAD (voice activity detection)
  → whisper-cli.exe (Whisper STT — local)
  → MQTT → hub
  → Ollama (gemma3:4b — local LLM)
  → MQTT → piper.exe (TTS — local)
  → sounddevice playback

Everything communicates over MQTT on localhost. The hub is a FastAPI service that orchestrates state, serves the dashboard, and handles the WebSocket connections for the UI.

The dashboard

The dashboard is cyberpunk on purpose. Scanlines, neon glow, monospace fonts, a real-time audio spectrum tied to the assistant's state. It's 1024×600, fixed — built to sit on a secondary display without being touched.

It shows what VESSEL is doing at any given moment: idle, listening, thinking, speaking. It shows sensor data from MQTT (presence detection, distance, temperature). It shows the conversation transcript, system metrics, and a weather widget.

It's the kind of thing that would look at home in a film about someone who actually knows what they're doing.

What it runs on

Whisper (tiny/small English model) for speech recognition — fast enough that you don't notice the delay
Ollama running gemma3:4b for the language model — genuinely capable for a 4B parameter model
Piper with a Norman voice for TTS — warm, fast, local
OpenWakeWord for wake word detection — "hey jarvis", tuned threshold

The whole thing starts with a PowerShell script and stops cleanly with another.

What it proved

The architecture works. Local voice AI is viable on consumer hardware. The latency is acceptable — under two seconds from wake word to the start of the response on a mid-range GPU.

What it didn't have was any real personality. The system prompt was a static string: name, location, style. It forgot everything between sessions. It didn't know what you'd said to it yesterday, let alone last week.

That was the next problem to solve.

The personality fork — persistent memory, mood, intent routing, and Spotify control — is covered in the next post.

// FEEDBACK

COMMENTS

← BACK TO JOURNAL