| Component | Version | Notes |
|---|---|---|
| Python | 3.11+ | Windows install from python.org |
| NVIDIA GPU driver | Latest | RTX series recommended |
| CUDA toolkit | 12.x | Required by faster-whisper |
| Mosquitto | 2.x | MQTT broker |
| WhisperLiveKit | Latest | pip install whisperlivekit |
| PlatformIO | Latest | Via VS Code extension |
Download from mosquitto.org and install with default settings. Start the service:
net start mosquitto
Verify it's running:
mosquitto_sub -h localhost -t "#" -v
pip install whisperlivekit
Start the server with diarization enabled:
wlk --model large-v3 --language en --diarization
The first run downloads the model (~3 GB). The WebSocket will be available at
ws://localhost:8000/asr. Verify by opening http://localhost:8000 in a browser.
Latency note: If
large-v3is too slow on your GPU, try--model distil-large-v3for similar accuracy at lower latency.
cd bridge
pip install -r requirements.txt
Run it:
python bridge.py
A small window opens for assigning friendly names to auto-detected speakers (SPEAKER_00, SPEAKER_01, …). The defaults (Pastor, Reader, Guest, Choir) are applied immediately — edit them if your service has different roles.
esp32/ folder in VS Code with the PlatformIO extension installed.Edit src/main.cpp — fill in your WiFi credentials and the PC's IP address:
#define WIFI_SSID "YourNetwork"
#define WIFI_PASSWORD "YourPassword"
#define MQTT_HOST "192.168.1.100" // run `ipconfig` on the PC to find this
Select the correct environment in PlatformIO:
esp32dev for ESP32-WROOM-32esp32-s3 for ESP32-S3 (recommended for larger RAM)Click Upload. Open Serial Monitor at 115200 baud to see boot messages.
Run these checks in order:
Whisper standalone — speak into the mic, verify text appears at
http://localhost:8000.
MQTT manually — with the ESP32 connected, publish a test message:
mosquitto_pub -h localhost -t display/text -m "{\"lines\":[\"Line one\",\"Line two\",\"Line three\"]}"
The display should refresh within ~2 seconds.
Full pipeline — start the bridge, speak naturally. Text should appear on the display within 3–5 seconds of speech.
Speaker labels — if two people speak alternately, [PASTOR] / [READER]
labels should appear as speaker changes are detected.
sc config mosquitto start=auto).bat file)