setup.md 3.1 KB

Setup Guide

Prerequisites

Component Version Notes
Python 3.11+ Windows install from python.org
NVIDIA GPU driver Latest RTX series recommended
CUDA toolkit 12.x Required by faster-whisper
Mosquitto 2.x MQTT broker
WhisperLiveKit Latest pip install whisperlivekit
PlatformIO Latest Via VS Code extension

1 — Install Mosquitto (MQTT broker)

Download from mosquitto.org and install with default settings. Start the service:

net start mosquitto

Verify it's running:

mosquitto_sub -h localhost -t "#" -v

2 — Install WhisperLiveKit

pip install whisperlivekit

Start the server with diarization enabled:

wlk --model large-v3 --language en --diarization

The first run downloads the model (~3 GB). The WebSocket will be available at ws://localhost:8000/asr. Verify by opening http://localhost:8000 in a browser.

Latency note: If large-v3 is too slow on your GPU, try --model distil-large-v3 for similar accuracy at lower latency.


3 — Install the Python bridge

cd bridge
pip install -r requirements.txt

Run it:

python bridge.py

A small window opens for assigning friendly names to auto-detected speakers (SPEAKER_00, SPEAKER_01, …). The defaults (Pastor, Reader, Guest, Choir) are applied immediately — edit them if your service has different roles.


4 — Flash the ESP32

  1. Open the esp32/ folder in VS Code with the PlatformIO extension installed.
  2. Edit src/main.cpp — fill in your WiFi credentials and the PC's IP address:

    #define WIFI_SSID     "YourNetwork"
    #define WIFI_PASSWORD "YourPassword"
    #define MQTT_HOST     "192.168.1.100"   // run `ipconfig` on the PC to find this
    
  3. Select the correct environment in PlatformIO:

    • esp32dev for ESP32-WROOM-32
    • esp32-s3 for ESP32-S3 (recommended for larger RAM)
  4. Click Upload. Open Serial Monitor at 115200 baud to see boot messages.


5 — End-to-end test

Run these checks in order:

  1. Whisper standalone — speak into the mic, verify text appears at http://localhost:8000.

  2. MQTT manually — with the ESP32 connected, publish a test message:

    mosquitto_pub -h localhost -t display/text -m "{\"lines\":[\"Line one\",\"Line two\",\"Line three\"]}"
    

The display should refresh within ~2 seconds.

  1. Full pipeline — start the bridge, speak naturally. Text should appear on the display within 3–5 seconds of speech.

  2. Speaker labels — if two people speak alternately, [PASTOR] / [READER] labels should appear as speaker changes are detected.


6 — Deployment checklist

  • PC set to never sleep during services
  • Mosquitto service set to start automatically (sc config mosquitto start=auto)
  • WhisperLiveKit added to Windows startup (Task Scheduler or a .bat file)
  • ESP32 powered from a USB wall adapter (not PC USB, to avoid dependency)
  • Static IP assigned to ESP32 in router DHCP settings
  • Audio input confirmed — direct mixer feed preferred over microphone