Building a Voice Assistant with Picovoice on Raspberry Pi

I grew up watching Knight Rider. KITT's voice was legendary — instant, intelligent, always listening. When I learned about Picovoice, I realized I could build something that felt like that: offline, fast, and completely private. No Alexa, no cloud processing, no data harvesting. Just a Raspberry Pi that responds when you say "Hey Pi" — and understands what you want.

Why Not the Cloud?

Most voice assistants send your audio to servers. Alexa, Siri, Google — they all need internet connection and they all store data (somehow, somewhere, for some reason). I wanted none of that.

Requirements:

✓ Offline operation — no internet required after setup
✓ Privacy-first — voice data never leaves the Pi
✓ Low latency — wake word detection in under 50ms
✓ Runs on Pi Zero 2 W — not just beefy hardware

Picovoice checked all those boxes. They provide three key engines:

Porcupine — ultra-accurate wake word detection (like "Hey Siri" but custom)
Rhino — intent recognition (what did the user actually ask for?)
Cheetah — speech-to-text (not needed for this project)

Architecture at a Glance

The system is a pipeline:

STEP 1

Microphone Stream

Raw audio I2S/PCM

STEP 2

Porcupine Wake Word

Detects "Hey Pi"

STEP 3

Rhino Intent Engine

Understands command

STEP 4

Execute Action

Say something, turn light on

Setting Up Picovoice Engines

Picovoice provides Python bindings via pvporcupine and pvrhino. Install with pip:

pip install pvporcupine
pip install pvrhino

You need an access key from Picovoice (free for non-commercial use). Then load the models:

import pvporcupine
import pvrhino
import struct

# Initialize Porcupine (wake word)
porcupine = pvporcupine.create(
    access_key=PICOVOICE_ACCESS_KEY,
    keyword_paths=['hey-pi.ppn']  # Custom trained wake word
)

# Initialize Rhino (intent)
rhino = pvrhino.create(
    access_key=PICOVOICE_ACCESS_KEY,
    context_path='smart-home.rhn'  # Custom intent model
)

Those model files (.ppn and .rhn) are the magic. Picovoice's console lets you train custom wake words and intents without writing audio processing code yourself. I trained "Hey Pi" using 30 samples of my voice (about 10 minutes of recording), then generated the .ppn file. For Rhino, I defined an intent schema:

{
  "intents": [
    {
      "name": "lightControl",
      "samples": [
        "turn on the light",
        "switch off the light",
        "lights please",
        "darken the room"
      ]
    },
    {
      "name": "temperatureQuery",
      "samples": [
        "what's the temperature",
        "how hot is it",
        "how cold is it in here"
      ]
    }
  ]
}

Picovoice trains a neural net from those examples, exports a .rhn file, and you're off to the races.

The Audio Pipeline: I2S to PCM

Getting raw audio from the Pi's microphone isn't trivial. I used a ReSpeaker 2-Mic HAT (USB audio interface) and pyaudio to stream 16-bit PCM samples at 16kHz:

import pyaudio

p = pyaudio.PyAudio()

stream = p.open(
    rate=porcupine.sample_rate,
    format=pyaudio.paInt16,
    channels=1,
    input=True,
    frames_per_buffer=porcupine.frame_length
)

while True:
    pcm = stream.read(porcupine.frame_length)
    pcm_unpacked = struct.unpack_from("h" * porcupine.frame_length, pcm)

    # Send to Porcupine for wake word detection
    keyword_index = porcupine.process(pcm_unpacked)

    if keyword_index >= 0:
        print("Wake word detected!")
        # Now run Rhino on the next N frames for intent
        detect_intent()

That porcupine.process() call returns -1 if no wake word, or >= 0 if it detected your keyword. Sub-100ms CPU time on Pi Zero 2 W. Insane.

Intent Recognition with Rhino

After the wake word fires, I collect the next ~1 second of audio (multiple frames) and feed it to Rhino. Rhino outputs either:

⊙ is_understood = true + intent + slots (like "turn on the light")
⊙ is_understood = false (speech wasn't clear or intent unknown)

def detect_intent():
    frames = []
    for _ in range(rhino.frame_length * 30):  # ~1 second
        pcm = stream.read(rhino.frame_length)
        frames.extend(struct.unpack_from("h" * rhino.frame_length, pcm))

    intent = rhino.process(frames)

    if intent.is_understood:
        print(f"Intent: {intent.intent}")
        print(f"Slots: {intent.slots}")

        if intent.intent == "lightControl":
            slot = intent.slots.get("state", "on")
            if slot == "on":
                turn_light_on()
            elif slot == "off":
                turn_light_off()
    else:
        print("Didn't understand that one.")

The slot filling is magic. Train Rhino with examples like "turn on the light" and "turn the light off", and it learns to extract "on"/"off" as a state slot. No regex, no keyword matching — just statistical learning on-device.

Giving It a Personality

I wanted KITT's voice. Not the actual intellectual property, but the vibe: confident, slightly robotic, helpful.

For TTS (text-to-speech), I used espeak-ng with a custom voice configuration:

import subprocess

def speak(text):
    subprocess.run([
        'espeak-ng',
        '-v', 'en+f3',        # Female voice, 3rd variant
        '-s', '130',          # Speed (words per minute)
        '-p', '70',           # Pitch (0-99)
        '-a', '200',          # Amplitude
        text
    ])

speak("Light is now on.")

en+f3 gives that slightly deeper, clearer timbre that feels like a car's onboard computer. Adjusting -p (pitch) and -s (speed) let me fine-tune it to something that sounded both intelligent and artificial.

Performance Numbers

On a Raspberry Pi Zero 2 W (1GHz quad-core ARM Cortex-A53, 1GB RAM), the whole pipeline runs comfortably under 15% CPU.

WAKE WORD DETECTION

~50ms

from audio to detection

INTENT RECOGNITION

~200ms

from end of speech to intent

TOTAL PIPELINE

~600ms

including TTS speak time

Those numbers feel instantaneous to a human. The wake word fires within a fraction of a second of you saying "Hey Pi", and the response is ready before you finish your command.

Challenges & Gotchas

False triggers. At first, "Hey" followed by any P-sound would fire. I trained the wake word with more negative samples (TV audio, random speech) and increased the sensitivity threshold. Still a few false positives, but acceptable.

Ambient noise. Kitchen exhaust fan at 3 feet = 60dB. Picovoice models are trained on noisy datasets, so they handle it decently, but Rhino started misunderstanding indoors commands outdoors. Retrained with outdoor samples — fixed it.

Latency variance. The Pi occasionally gets CPU spikes from other services (Home Assistant, system updates). I added a watchdog that restarts the voice assistant if process() takes over 200ms.

What's Next

I've got the basics working: lights on/off, temperature query, "what time is it". Next steps:

→ Integrate with Home Assistant via REST API for full home control
→ Add wake-up alarm with gradual Philips Hue dimming
→ Multi-turn conversations (context awareness)

The coolest part? This all runs on $35 hardware with zero monthly fees. Compare that to an Echo Dot ($50 + Amazon's data harvesting) or HomePod ($300 locked into Apple's ecosystem).

WANT TO BUILD THIS?

I'm available for IoT and hardware integration projects. Get in touch.

← Back to Blog ← Previous: Bluetooth Architecture