# CLAUDE.md — AI Development Context This file provides context for AI-assisted development sessions on the Church Live Transcription Display project. --- ## Project Summary A live captioning system for deaf/hard-of-hearing church congregants. A Windows PC captures audio, transcribes it locally using Whisper (GPU-accelerated), performs real-time speaker diarization, maps anonymous speaker IDs to real names, and serves a fullscreen display page over the local WiFi network. Any tablet, TV, or browser-capable device on the same network can act as the display. No cloud services. No internet required during operation. --- ## Architecture ```text [Audio source] ↓ (USB mic or mixer line-in) [Windows PC] ├── WhisperLiveKit (port 8000) │ ├── Whisper large-v3 (transcription) │ └── diart / pyannote (real-time speaker diarization) │ WebSocket output: ws://localhost:8000/asr │ ├── Mosquitto MQTT broker (port 1883, internal bus) │ ├── bridge.py │ ├── Subscribes to Whisper WebSocket │ ├── Receives: {text, speaker_id, is_final, ...} │ ├── Resolves speaker_id → name via speakers.json │ ├── Buffers text to sentence boundary │ └── Publishes JSON payload to MQTT topic display/text │ └── admin.py (port 8001) ├── Speaker name management (REST API + web UI) ├── Per-speaker voice sample library ├── Test recording playback ├── Subscribes to MQTT display/text └── /display — fullscreen display page (SSE push to browsers) ↓ WiFi (browser on local network) [Any tablet / TV / browser-capable device] └── http://[PC-IP]:8001/display ← fullscreen display page ``` Multiple display devices can connect simultaneously. --- ## PC Environment - OS: Windows 10/11 - GPU: NVIDIA RTX 5060 Ti 16 GB (production); RTX 4070 Super also tested - Python: 3.12 (required — PyTorch wheels not published for 3.13+ yet) - MQTT broker: Mosquitto (localhost:1883) - Diarization: diart (pyannote.audio streaming) — requires HuggingFace token and accepted licence for `pyannote/speaker-diarization-3.1` and `pyannote/segmentation-3.0` ### RTX 5060 Ti / CUDA 13 compatibility ctranslate2 must be pinned to `==4.5.0`. Install the following pip packages explicitly (the CUDA runtime is bundled this way, no system CUDA Toolkit required for ctranslate2 itself): ```text nvidia-cublas-cu12 nvidia-cudnn-cu12 nvidia-cuda-runtime-cu12 ``` `setuptools` must be `<82` to avoid a `pkg_resources` import error at startup. Confirmed working on this machine: CUDA 13.2, CUDA devices: 1. ### CUDA Notes - CUDA Toolkit 13.2 is installed on the production PC (`nvcc --version` confirms; `nvidia-smi` shows driver 595.79) - PyTorch installed from the CUDA 12.4 index (`--index-url https://download.pytorch.org/whl/cu124`) — PyTorch cu124 wheels are forward-compatible with CUDA 13.x drivers - Without CUDA, WhisperLiveKit falls back to CPU; large-v3 on CPU is ~15× slower than real-time — not viable for live services - **Triton kernel warning** — at startup you will see `Failed to launch Triton kernels, likely due to missing CUDA toolkit`. This is **misleading** — Triton (the Python package) does not support Windows at all. The fallback to a median kernel is expected and harmless. Under the LocalAgreement backend (current), these timing kernels are not used anyway. --- ## Display Environment The display is a fullscreen browser page served by `admin.py` at `/display`. - **URL**: `http://[PC-IP]:8001/display` — open on any tablet, TV, or spare device on the same WiFi - **Push mechanism**: Server-Sent Events (SSE) from `admin.py` — `admin.py` subscribes to MQTT and forwards display/text payloads to connected browsers via SSE - **Layout**: Speaker name header at top, 3 rolling lines of transcription text below; font scales to screen size - **Full-screen**: press F11 in browser (or use guided kiosk mode on tablet) - **Multiple simultaneous displays**: each browser is an independent SSE subscriber No microcontroller, firmware, or hardware assembly is required. --- ## MQTT Topics | Topic | Direction | Payload | | --- | --- | --- | | `display/text` | bridge → admin.py → display browsers | JSON: see schema below | | `display/clear` | bridge → admin.py → display browsers | Empty | ### display/text Payload Schema ```json { "lines": [ "PASTOR JOHN", "...and He said unto them, go", "into all the world and preach" ] } ``` - `lines`: array of strings, max `DISPLAY_LINES` items (currently 3); speaker name injected as first line on speaker change - Bridge pre-wraps text at `MAX_LINE_CHARS` (60) using `textwrap.wrap`; publishes one MQTT message per wrapped line so the display scrolls one line at a time --- ## Key Files ### `bridge/bridge.py` Main audio pipeline. Headless — no UI. Uses `AudioProcessor` + `TranscriptionEngine` directly (no WebSocket), publishes to Mosquitto. **Current state:** - `BridgeState` class holds all mutable state (thread-safe via `threading.Lock`) - `speaker_names`: dict loaded from `speakers.json`, polled for changes every 5s via `_speaker_reloader()` - `push_final()`: accumulates text, detects speaker change, flushes on sentence boundary or timeout - `_flush()`: word-wraps with `textwrap.wrap(text, 60)`, publishes **one MQTT message per line** (so display scrolls one line at a time), injects `[SPEAKER NAME]` label on speaker change - `_receive_results()`: delta-tracks full concatenated transcript across `FrontData.lines` to avoid double-counting the growing last segment - `_choose_audio_device()`: lists input devices, respects `AUDIO_DEVICE` config constant - Audio path: `sounddevice.InputStream` → asyncio queue → `audio_processor.process_audio()` - Inject API (port 8002): `POST /inject` accepts raw PCM bytes from admin.py test playback **Config constants** (top of file): - `MQTT_HOST`, `SAMPLE_RATE=16000`, `BLOCKSIZE=4096` - `SENTENCE_TIMEOUT=4.0`, `MAX_LINE_CHARS=60`, `DISPLAY_LINES=3` - `AUDIO_DEVICE=12` — Logitech BRIO; set to `None` to use Windows default **TranscriptionEngine settings:** - `backend_policy="localagreement"` — WhisperStreaming local-agreement algorithm; more accurate than SimulStreaming, ~2s additional latency - `confidence_validation=True` — suppresses low-confidence tokens (reduces hallucinations on breath/pause) - Underlying faster-whisper uses `beam_size=5` (hardcoded in `FasterWhisperASR`) ### `bridge/admin.py` FastAPI web server on port 8001. Single-file — HTML/CSS/JS embedded as a Python string. **Endpoints:** - `GET /` — speaker management web UI - `GET|POST /api/speakers` — list / add speakers - `PUT|DELETE /api/speakers/{sid}` — rename / remove speaker - `POST|GET /api/speakers/{sid}/recording` — upload / serve per-speaker voice sample - `POST /api/test/upload` — upload full-service test recording - `GET /api/test/files` — list test recordings - `DELETE /api/test/files/{filename}` — delete test recording - `POST /api/test/start` — stream test recording to WhisperLiveKit (via `_stream_file()`) - `POST /api/test/stop` — cancel active playback - `GET /api/test/status` — playback progress / state - `GET /display` — fullscreen display page (black background, Georgia serif, 3 rolling lines, speaker header in gold) - `GET /api/display/stream` — SSE endpoint; subscribes to MQTT via paho, pushes `event: text` / `event: clear` to all connected browsers **Test playback**: `_stream_file()` is an asyncio task that reads audio via `miniaudio.stream_file()` (handles WAV/MP3/FLAC/OGG/M4A, resamples to 16kHz mono) and POSTs raw PCM chunks to `http://127.0.0.1:8002/inject` on bridge.py, which queues them ahead of live microphone input. ### `bridge/whisper_launcher.py` Startup wrapper for WhisperLiveKit. Applies ffmpeg PATH fix and torchaudio shim, then calls `whisperlivekit.cli.main()`. Used by `start.bat` instead of `wlk` directly. ### `bridge/speakers.json` Auto-created on first run. Format: `{"SPEAKER_00": "Pastor", "SPEAKER_01": "Reader", ...}`. Seeded with 4 defaults. Persists across sessions. Written by both `bridge.py` and `admin.py`; `bridge.py` polls mtime every 5s to pick up admin changes. ### `bridge/requirements.txt` ```text paho-mqtt>=2.0 websockets>=12.0 sounddevice>=0.4.6 numpy>=1.24 fastapi>=0.111 uvicorn>=0.29 python-multipart>=0.0.9 miniaudio>=1.59 imageio-ffmpeg>=2.9 ``` --- ## Display Layout Browser-based, scales to any screen size. ```text ┌─────────────────────────────────────────────────┐ │ PASTOR JOHN │ ← speaker name, prominent, top section │─────────────────────────────────────────────────│ │ ...and He said unto them, go into all the │ ← line 1 │ world and preach the gospel to every │ ← line 2 │ creature. He that believeth and is baptised │ ← line 3 └─────────────────────────────────────────────────┘ ``` - Font size scales with viewport width — target readability at 3–5 metres - High contrast: white text on dark background recommended for bright church environments - Speaker name shown only when speaker changes (not repeated per line) - Instant update via SSE — no refresh flash --- ## Speaker Diarization Notes ### Active approach — diart (pyannote.audio) - Launched via `--diarization-backend diart` in `whisper_launcher.py` - torchaudio compatibility shim applied in launcher (set_audio_backend removed in 2.x) - Tracks 2–4+ speakers reliably in clean audio conditions - Works best with direct mixer feed; background music may confuse diarization - Congregation responses ("Amen", "Hallelujah") appear as brief unknown speakers — minimum-duration filter (~2s) before triggering admin alert is a future improvement ### v1 — Operator-Assisted Naming (current) - New `SPEAKER_XX` IDs appear automatically in the admin web table within 5s (via speakers.json polling) - Operator types the name in the table row and saves — takes effect immediately - Speaker names persist across sessions in speakers.json ### v2 — Voice Enrolment (planned) - Upload 10–30s clear speech sample per speaker via admin page Voice Sample column (already implemented) - Extract embedding using pyannote `SpeakerEmbedding` pipeline - At runtime, compare incoming `SPEAKER_N` embedding to stored profiles - Auto-assign name if cosine similarity > threshold (~0.85); fall back to operator prompt otherwise - Embeddings stored in `bridge/profiles/.npy` --- ## Design Constraints & Open Questions - [x] Display page `/display` — built; fullscreen browser page in admin.py - [x] SSE push from admin.py to display browsers — implemented; paho MQTT subscriber in admin.py, `loop.call_soon_threadsafe` to asyncio queues - [x] CUDA Toolkit — installed (13.2); GPU acceleration confirmed working - [ ] Minimum speaker segment duration before adding to admin table (avoid congregation one-liners populating 50 rows) - [ ] Voice enrolment v2 — pyannote.audio is installed, extraction pipeline not yet written - [ ] Word-wrap edge cases: long proper nouns, scripture references - [ ] Session save/restore: if PC crashes mid-service, speakers.json persists so names reload immediately on restart - [ ] Audio routing on Windows: ensure Whisper receives the mixer/mic channel, not system audio --- ## Testing Approach 1. **Whisper standalone**: speak into mic, verify text in browser at `http://localhost:8000` 2. **Diarization**: two people alternate speaking, verify `SPEAKER_00` / `SPEAKER_01` labels in WS output 3. **Bridge**: run `bridge.py`, verify MQTT payloads via `mosquitto_sub -t display/#` 4. **Admin**: open `http://localhost:8001`, verify speaker rows appear, rename one, confirm bridge picks up the change within 5s 5. **Test playback**: upload a full service recording via admin, press Play at 4×, verify transcription appears in MQTT and display 6. **Display page**: open `http://[PC-IP]:8001/display` on tablet, verify text updates in real time 7. **In-situ trial**: 1–2 Sunday services with a volunteer congregant providing feedback --- ## Development Sequence (Remaining) 1. ~~Build `/display` fullscreen browser page~~ — done 2. ~~SSE push (`/api/display/stream`)~~ — done 3. ~~CUDA Toolkit installation~~ — done (13.2) 4. Voice enrolment v2 — extract pyannote embeddings from uploaded samples, add matching logic to bridge 5. Church deployment trial --- ## Further Enhancements - Convert speakers.json to remote database for multi-event / multi-location usage - Transcription log: table of speaker name + first sentence of each turn, exportable after the service - Minimum-duration filter: suppress `SPEAKER_XX` rows for segments under ~2s (congregation responses)