Setup Guide

Prerequisites

Component	Version	Notes
Python	3.11+	Windows install from python.org
NVIDIA GPU driver	Latest	RTX series recommended
CUDA toolkit	12.x	Required by faster-whisper
Mosquitto	2.x	MQTT broker
WhisperLiveKit	Latest	`pip install whisperlivekit`
PlatformIO	Latest	Via VS Code extension

1 — Install Mosquitto (MQTT broker)

Download from mosquitto.org and install with default settings. Start the service:

net start mosquitto

Verify it's running:

mosquitto_sub -h localhost -t "#" -v

2 — Install WhisperLiveKit

pip install whisperlivekit

Start the server with diarization enabled:

wlk --model large-v3 --language en --diarization

The first run downloads the model (~3 GB). The WebSocket will be available at ws://localhost:8000/asr. Verify by opening http://localhost:8000 in a browser.

Latency note: If large-v3 is too slow on your GPU, try --model distil-large-v3 for similar accuracy at lower latency.

3 — Install the Python bridge

cd bridge
pip install -r requirements.txt

Run it:

python bridge.py

A small window opens for assigning friendly names to auto-detected speakers (SPEAKER_00, SPEAKER_01, …). The defaults (Pastor, Reader, Guest, Choir) are applied immediately — edit them if your service has different roles.

4 — Flash the ESP32

Open the esp32/ folder in VS Code with the PlatformIO extension installed.

Edit src/main.cpp — fill in your WiFi credentials and the PC's IP address:

#define WIFI_SSID     "YourNetwork"
#define WIFI_PASSWORD "YourPassword"
#define MQTT_HOST     "192.168.1.100"   // run `ipconfig` on the PC to find this

Select the correct environment in PlatformIO:
- esp32dev for ESP32-WROOM-32
- esp32-s3 for ESP32-S3 (recommended for larger RAM)
Click Upload. Open Serial Monitor at 115200 baud to see boot messages.

5 — End-to-end test

Run these checks in order:

Whisper standalone — speak into the mic, verify text appears at http://localhost:8000.

MQTT manually — with the ESP32 connected, publish a test message:

mosquitto_pub -h localhost -t display/text -m "{\"lines\":[\"Line one\",\"Line two\",\"Line three\"]}"

The display should refresh within ~2 seconds.

Full pipeline — start the bridge, speak naturally. Text should appear on the display within 3–5 seconds of speech.
Speaker labels — if two people speak alternately, [PASTOR] / [READER] labels should appear as speaker changes are detected.

6 — Deployment checklist

PC set to never sleep during services
Mosquitto service set to start automatically (sc config mosquitto start=auto)
WhisperLiveKit added to Windows startup (Task Scheduler or a .bat file)
ESP32 powered from a USB wall adapter (not PC USB, to avoid dependency)
Static IP assigned to ESP32 in router DHCP settings
Audio input confirmed — direct mixer feed preferred over microphone

setup.md 3.1 KB Historie Surový