For a while now, I’ve been thinking about the idea of a voice assistant that doesn’t live in the cloud. Something fast, privacy-respecting, and fully offline. I have yet to use one that worked at a truely conversation pace.
So I built one. Or rather, I built version one—and while it works, its not fast - yet.
This first version is just about learning and experimenting: I was able to record audio, transcribe it, run it through a local LLM, and speak the response back—all without leaving my machine. But now I need to go back and optimize.
The core goal was simple:
Speak to my computer → Get an intelligent, spoken response → Do it all fast.
In reality, I hit a few performance roadblocks—but the structure is there, and the tools are all local. I wanted to practice chaining together multiple AI tools to build something semi-cohesive and voice-based.
Each component runs offline, stitched together with Python and shell scripts.
I used tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf —a tiny quantized model served by llama.cpp. It’s not the smartest LLM around, but it runs fast-ish on CPU and supports chat-like completions over a local server. The value was execution in this case so it was a quick decision.
You send a JSON payload to the locally running server, and it returns a reply.
Whisper handles audio transcription. I record 5 seconds of audio using arecord, then pass that WAV file to Whisper.
It spits out a txt file with the transcript, which becomes the prompt for the LLM.
It’s decently fast and very accurate, but still introduces a few seconds of delay.
Once I get the LLM’s reply, I pass it to Coqui’s TTS library, which converts it into a WAV file:
And here’s one of my first critiques: this is slow. Synthesizing the full audio into a WAV file introduces noticeable delay, especially for longer outputs. Responses around 5 seconds for me which is not terrible but not conversational pace.
Here’s how the assistant works in a loop:
It’s satisfying when it works—feels like magic—but the loop takes several seconds to complete.
I’m thinking about the next version. Here’s what I want to focus on:
Linking this in with the knowledge worker code would be great for having a conversation with a model fine tuned on specific data.
Check out the README.md for instructions > https://github.com/CodeJonesW/local-voice-assistant