This file provides context for AI-assisted development sessions on the Church Live Transcription Display project.
A live captioning system for deaf/hard-of-hearing church congregants. A Windows PC captures audio, transcribes it locally using Whisper (GPU-accelerated), performs real-time speaker diarization, maps anonymous speaker IDs to real names, and serves a fullscreen display page over the local WiFi network. Any tablet, TV, or browser-capable device on the same network can act as the display. No cloud services. No internet required during operation.
[Audio source]
↓ (USB mic or mixer line-in)
[Windows PC]
├── WhisperLiveKit (port 8000)
│ ├── Whisper large-v3 (transcription)
│ └── diart / pyannote (real-time speaker diarization)
│ WebSocket output: ws://localhost:8000/asr
│
├── Mosquitto MQTT broker (port 1883, internal bus)
│
├── bridge.py
│ ├── Subscribes to Whisper WebSocket
│ ├── Receives: {text, speaker_id, is_final, ...}
│ ├── Resolves speaker_id → name via speakers.json
│ ├── Buffers text to sentence boundary
│ └── Publishes JSON payload to MQTT topic display/text
│
└── admin.py (port 8001)
├── Speaker name management (REST API + web UI)
├── Per-speaker voice sample library
├── Test recording playback
├── Subscribes to MQTT display/text
└── /display — fullscreen display page (SSE push to browsers)
↓ WiFi (browser on local network)
[Any tablet / TV / browser-capable device]
└── http://[PC-IP]:8001/display ← fullscreen display page
Multiple display devices can connect simultaneously.
pyannote/speaker-diarization-3.1 and pyannote/segmentation-3.0ctranslate2 must be pinned to ==4.5.0. Install the following pip packages explicitly (the CUDA runtime is bundled this way, no system CUDA Toolkit required for ctranslate2 itself):
nvidia-cublas-cu12
nvidia-cudnn-cu12
nvidia-cuda-runtime-cu12
setuptools must be <82 to avoid a pkg_resources import error at startup.
Confirmed working on this machine: CUDA 13.2, CUDA devices: 1.
nvcc --version confirms; nvidia-smi shows driver 595.79)--index-url https://download.pytorch.org/whl/cu124) — PyTorch cu124 wheels are forward-compatible with CUDA 13.x driversFailed to launch Triton kernels, likely due to missing CUDA toolkit. This is misleading — Triton (the Python package) does not support Windows at all. The fallback to a median kernel is expected and harmless. Under the LocalAgreement backend (current), these timing kernels are not used anyway.The display is a fullscreen browser page served by admin.py at /display.
http://[PC-IP]:8001/display — open on any tablet, TV, or spare device on the same WiFiadmin.py — admin.py subscribes to MQTT and forwards display/text payloads to connected browsers via SSENo microcontroller, firmware, or hardware assembly is required.
| Topic | Direction | Payload |
|---|---|---|
display/text |
bridge → admin.py → display browsers | JSON: see schema below |
display/clear |
bridge → admin.py → display browsers | Empty |
{
"lines": [
"PASTOR JOHN",
"...and He said unto them, go",
"into all the world and preach"
]
}
lines: array of strings, max DISPLAY_LINES items (currently 3); speaker name injected as first line on speaker changeMAX_LINE_CHARS (60) using textwrap.wrap; publishes one MQTT message per wrapped line so the display scrolls one line at a timebridge/bridge.pyMain audio pipeline. Headless — no UI. Uses AudioProcessor + TranscriptionEngine directly (no WebSocket), publishes to Mosquitto.
Current state:
BridgeState class holds all mutable state (thread-safe via threading.Lock)speaker_names: dict loaded from speakers.json, polled for changes every 5s via _speaker_reloader()push_final(): accumulates text, detects speaker change, flushes on sentence boundary or timeout_flush(): word-wraps with textwrap.wrap(text, 60), publishes one MQTT message per line (so display scrolls one line at a time), injects [SPEAKER NAME] label on speaker change_receive_results(): delta-tracks full concatenated transcript across FrontData.lines to avoid double-counting the growing last segment_choose_audio_device(): lists input devices, respects AUDIO_DEVICE config constantsounddevice.InputStream → asyncio queue → audio_processor.process_audio()POST /inject accepts raw PCM bytes from admin.py test playbackConfig constants (top of file):
MQTT_HOST, SAMPLE_RATE=16000, BLOCKSIZE=4096SENTENCE_TIMEOUT=4.0, MAX_LINE_CHARS=60, DISPLAY_LINES=3AUDIO_DEVICE=12 — Logitech BRIO; set to None to use Windows defaultTranscriptionEngine settings:
backend_policy="localagreement" — WhisperStreaming local-agreement algorithm; more accurate than SimulStreaming, ~2s additional latencyconfidence_validation=True — suppresses low-confidence tokens (reduces hallucinations on breath/pause)beam_size=5 (hardcoded in FasterWhisperASR)bridge/admin.pyFastAPI web server on port 8001. Single-file — HTML/CSS/JS embedded as a Python string.
Endpoints:
GET / — speaker management web UIGET|POST /api/speakers — list / add speakersPUT|DELETE /api/speakers/{sid} — rename / remove speakerPOST|GET /api/speakers/{sid}/recording — upload / serve per-speaker voice samplePOST /api/test/upload — upload full-service test recordingGET /api/test/files — list test recordingsDELETE /api/test/files/{filename} — delete test recordingPOST /api/test/start — stream test recording to WhisperLiveKit (via _stream_file())POST /api/test/stop — cancel active playbackGET /api/test/status — playback progress / stateGET /display — fullscreen display page (black background, Georgia serif, 3 rolling lines, speaker header in gold)GET /api/display/stream — SSE endpoint; subscribes to MQTT via paho, pushes event: text / event: clear to all connected browsersTest playback: _stream_file() is an asyncio task that reads audio via miniaudio.stream_file() (handles WAV/MP3/FLAC/OGG/M4A, resamples to 16kHz mono) and POSTs raw PCM chunks to http://127.0.0.1:8002/inject on bridge.py, which queues them ahead of live microphone input.
bridge/whisper_launcher.pyStartup wrapper for WhisperLiveKit. Applies ffmpeg PATH fix and torchaudio shim, then calls whisperlivekit.cli.main(). Used by start.bat instead of wlk directly.
bridge/speakers.jsonAuto-created on first run. Format: {"SPEAKER_00": "Pastor", "SPEAKER_01": "Reader", ...}. Seeded with 4 defaults. Persists across sessions. Written by both bridge.py and admin.py; bridge.py polls mtime every 5s to pick up admin changes.
bridge/requirements.txtpaho-mqtt>=2.0
websockets>=12.0
sounddevice>=0.4.6
numpy>=1.24
fastapi>=0.111
uvicorn>=0.29
python-multipart>=0.0.9
miniaudio>=1.59
imageio-ffmpeg>=2.9
Browser-based, scales to any screen size.
┌─────────────────────────────────────────────────┐
│ PASTOR JOHN │ ← speaker name, prominent, top section
│─────────────────────────────────────────────────│
│ ...and He said unto them, go into all the │ ← line 1
│ world and preach the gospel to every │ ← line 2
│ creature. He that believeth and is baptised │ ← line 3
└─────────────────────────────────────────────────┘
--diarization-backend diart in whisper_launcher.pySPEAKER_XX IDs appear automatically in the admin web table within 5s (via speakers.json polling)SpeakerEmbedding pipelineSPEAKER_N embedding to stored profilesbridge/profiles/<name>.npy/display — built; fullscreen browser page in admin.pyloop.call_soon_threadsafe to asyncio queueshttp://localhost:8000SPEAKER_00 / SPEAKER_01 labels in WS outputbridge.py, verify MQTT payloads via mosquitto_sub -t display/#http://localhost:8001, verify speaker rows appear, rename one, confirm bridge picks up the change within 5shttp://[PC-IP]:8001/display on tablet, verify text updates in real time/display fullscreen browser page/api/display/stream)SPEAKER_XX rows for segments under ~2s (congregation responses)