V1 of My Local Voice Assistant: Chaining Tools Toward Speed

For a while now, I’ve been thinking about the idea of a voice assistant that doesn’t live in the cloud. Something fast, privacy-respecting, and fully offline. I have yet to use one that worked at a truely conversation pace.

So I built one. Or rather, I built version one—and while it works, its not fast - yet.

This first version is just about learning and experimenting: I was able to record audio, transcribe it, run it through a local LLM, and speak the response back—all without leaving my machine. But now I need to go back and optimize.

The Goal: Speedy, Local, Conversational AI

The core goal was simple:

Speak to my computer → Get an intelligent, spoken response → Do it all fast.

In reality, I hit a few performance roadblocks—but the structure is there, and the tools are all local. I wanted to practice chaining together multiple AI tools to build something semi-cohesive and voice-based.

🛠️ What's Under the Hood?

Audio recording with Arecord
Transcription with Whisper.cpp
Language model with LLaMA.cpp
Speech synthesis with TTS (Tacotron2)
Audio playback with Aplay

Each component runs offline, stitched together with Python and shell scripts.

Local LLM with LLaMA.cpp

I used tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf —a tiny quantized model served by llama.cpp. It’s not the smartest LLM around, but it runs fast-ish on CPU and supports chat-like completions over a local server. The value was execution in this case so it was a quick decision.

You send a JSON payload to the locally running server, and it returns a reply.

Voice In, Words Out – Whisper.cpp

Whisper handles audio transcription. I record 5 seconds of audio using arecord, then pass that WAV file to Whisper.

It spits out a txt file with the transcript, which becomes the prompt for the LLM.

It’s decently fast and very accurate, but still introduces a few seconds of delay.

Text to Speech with TTS (Tacotron2)

Once I get the LLM’s reply, I pass it to Coqui’s TTS library, which converts it into a WAV file:

And here’s one of my first critiques: this is slow. Synthesizing the full audio into a WAV file introduces noticeable delay, especially for longer outputs. Responses around 5 seconds for me which is not terrible but not conversational pace.

The Loop

Here’s how the assistant works in a loop:

Record 5s of mic audio
Transcribe with Whisper
Send prompt to LLaMA server
Convert response to speech
Play response aloud

It’s satisfying when it works—feels like magic—but the loop takes several seconds to complete.

Known Issues and Limitations (v1)

Hallucinations – TinyLLaMA sometimes goes off-topic or invents facts. Response quality is inconsistent.
TTS latency – Synthesizing and playing full WAVs creates lag. Ideally, audio would stream back progressively.
Not actually fast – Despite my goal, it takes ~5–10 seconds per round trip, depending on the length of the input and output.
Static timing – Audio recording is fixed to 5 seconds. Would prefer it to stop when I stop speaking.

🚀 V2 Goals

I’m thinking about the next version. Here’s what I want to focus on:

Stream audio output instead of generating a full WAV file first.
Smarter prompt formatting to reduce hallucinations.
Faster model – either a better quantization or swap in a bigger one with GPU acceleration.
Voice activity detection for smarter recording.
Interactive back-and-forth — respond while listening.

Continued Ideas

Linking this in with the knowledge worker code would be great for having a conversation with a model fine tuned on specific data.

🧱 Build It Yourself

Check out the README.md for instructions > https://github.com/CodeJonesW/local-voice-assistant