I grew up watching Knight Rider. KITT's voice was legendary — instant, intelligent, always listening. When I learned about Picovoice, I realized I could build something that felt like that: offline, fast, and completely private. No Alexa, no cloud processing, no data harvesting. Just a Raspberry Pi that responds when you say "Hey Pi" — and understands what you want.
Why Not the Cloud?
Most voice assistants send your audio to servers. Alexa, Siri, Google — they all need internet connection and they all store data (somehow, somewhere, for some reason). I wanted none of that.
Requirements:
- ✓ Offline operation — no internet required after setup
- ✓ Privacy-first — voice data never leaves the Pi
- ✓ Low latency — wake word detection in under 50ms
- ✓ Runs on Pi Zero 2 W — not just beefy hardware
Picovoice checked all those boxes. They provide three key engines:
- Porcupine — ultra-accurate wake word detection (like "Hey Siri" but custom)
- Rhino — intent recognition (what did the user actually ask for?)
- Cheetah — speech-to-text (not needed for this project)
Architecture at a Glance
The system is a pipeline:
Setting Up Picovoice Engines
Picovoice provides Python bindings via pvporcupine and pvrhino. Install with pip:
pip install pvporcupine
pip install pvrhino
You need an access key from Picovoice (free for non-commercial use). Then load the models:
import pvporcupine
import pvrhino
import struct
# Initialize Porcupine (wake word)
porcupine = pvporcupine.create(
access_key=PICOVOICE_ACCESS_KEY,
keyword_paths=['hey-pi.ppn'] # Custom trained wake word
)
# Initialize Rhino (intent)
rhino = pvrhino.create(
access_key=PICOVOICE_ACCESS_KEY,
context_path='smart-home.rhn' # Custom intent model
)
Those model files (.ppn and .rhn) are the magic. Picovoice's console lets you train custom wake words and intents without writing audio processing code yourself. I trained "Hey Pi" using 30 samples of my voice (about 10 minutes of recording), then generated the .ppn file. For Rhino, I defined an intent schema:
{
"intents": [
{
"name": "lightControl",
"samples": [
"turn on the light",
"switch off the light",
"lights please",
"darken the room"
]
},
{
"name": "temperatureQuery",
"samples": [
"what's the temperature",
"how hot is it",
"how cold is it in here"
]
}
]
}
Picovoice trains a neural net from those examples, exports a .rhn file, and you're off to the races.
The Audio Pipeline: I2S to PCM
Getting raw audio from the Pi's microphone isn't trivial. I used a ReSpeaker 2-Mic HAT (USB audio interface) and pyaudio to stream 16-bit PCM samples at 16kHz:
import pyaudio
p = pyaudio.PyAudio()
stream = p.open(
rate=porcupine.sample_rate,
format=pyaudio.paInt16,
channels=1,
input=True,
frames_per_buffer=porcupine.frame_length
)
while True:
pcm = stream.read(porcupine.frame_length)
pcm_unpacked = struct.unpack_from("h" * porcupine.frame_length, pcm)
# Send to Porcupine for wake word detection
keyword_index = porcupine.process(pcm_unpacked)
if keyword_index >= 0:
print("Wake word detected!")
# Now run Rhino on the next N frames for intent
detect_intent()
That porcupine.process() call returns -1 if no wake word, or >= 0 if it detected your keyword. Sub-100ms CPU time on Pi Zero 2 W. Insane.
Intent Recognition with Rhino
After the wake word fires, I collect the next ~1 second of audio (multiple frames) and feed it to Rhino. Rhino outputs either:
-
⊙
is_understood = true+intent+slots(like "turn on the light") -
⊙
is_understood = false(speech wasn't clear or intent unknown)
def detect_intent():
frames = []
for _ in range(rhino.frame_length * 30): # ~1 second
pcm = stream.read(rhino.frame_length)
frames.extend(struct.unpack_from("h" * rhino.frame_length, pcm))
intent = rhino.process(frames)
if intent.is_understood:
print(f"Intent: {intent.intent}")
print(f"Slots: {intent.slots}")
if intent.intent == "lightControl":
slot = intent.slots.get("state", "on")
if slot == "on":
turn_light_on()
elif slot == "off":
turn_light_off()
else:
print("Didn't understand that one.")
The slot filling is magic. Train Rhino with examples like "turn on the light" and "turn the light off", and it learns to extract "on"/"off" as a state slot. No regex, no keyword matching — just statistical learning on-device.
Giving It a Personality
I wanted KITT's voice. Not the actual intellectual property, but the vibe: confident, slightly robotic, helpful.
For TTS (text-to-speech), I used espeak-ng with a custom voice configuration:
import subprocess
def speak(text):
subprocess.run([
'espeak-ng',
'-v', 'en+f3', # Female voice, 3rd variant
'-s', '130', # Speed (words per minute)
'-p', '70', # Pitch (0-99)
'-a', '200', # Amplitude
text
])
speak("Light is now on.")
en+f3 gives that slightly deeper, clearer timbre that feels like a car's onboard computer. Adjusting -p (pitch) and -s (speed) let me fine-tune it to something that sounded both intelligent and artificial.
Performance Numbers
On a Raspberry Pi Zero 2 W (1GHz quad-core ARM Cortex-A53, 1GB RAM), the whole pipeline runs comfortably under 15% CPU.
Those numbers feel instantaneous to a human. The wake word fires within a fraction of a second of you saying "Hey Pi", and the response is ready before you finish your command.
Challenges & Gotchas
False triggers. At first, "Hey" followed by any P-sound would fire. I trained the wake word with more negative samples (TV audio, random speech) and increased the sensitivity threshold. Still a few false positives, but acceptable.
Ambient noise. Kitchen exhaust fan at 3 feet = 60dB. Picovoice models are trained on noisy datasets, so they handle it decently, but Rhino started misunderstanding indoors commands outdoors. Retrained with outdoor samples — fixed it.
Latency variance. The Pi occasionally gets CPU spikes from other services (Home Assistant, system updates). I added a watchdog that restarts the voice assistant if process() takes over 200ms.
What's Next
I've got the basics working: lights on/off, temperature query, "what time is it". Next steps:
- → Integrate with Home Assistant via REST API for full home control
- → Add wake-up alarm with gradual Philips Hue dimming
- → Multi-turn conversations (context awareness)
The coolest part? This all runs on $35 hardware with zero monthly fees. Compare that to an Echo Dot ($50 + Amazon's data harvesting) or HomePod ($300 locked into Apple's ecosystem).