# CLAUDE.md — AI Development Context This file provides context for AI-assisted development sessions on the Church Live Transcription Display project. --- ## Project Summary A live captioning system for deaf/hard-of-hearing church congregants. A Windows PC captures audio, transcribes it locally using Whisper (GPU-accelerated), performs real-time speaker diarization, maps anonymous speaker IDs to real names, and sends speaker-tagged rolling text over MQTT to an ESP32 driving a large e-ink display. No cloud services. No internet required during operation. --- ## Architecture ``` [Audio source] ↓ (USB mic or mixer line-in) [Windows PC] ├── WhisperLiveKit │ ├── Whisper large-v3 (transcription) │ └── Streaming Sortformer (real-time speaker diarization) │ WebSocket output: ws://localhost:8000/asr │ ├── Mosquitto MQTT broker (port 1883) │ ├── bridge.py │ ├── Subscribes to Whisper WebSocket │ ├── Receives: {text, speaker_id, is_final, ...} │ ├── Resolves speaker_id → name via speaker_registry │ ├── Buffers text to sentence boundary │ └── Publishes JSON payload to MQTT topic display/text │ └── admin_ui.py (Tkinter) ├── Shows "New speaker detected" prompts ├── Operator types name once per unknown speaker └── Updates speaker_registry in real time ↓ WiFi / MQTT [ESP32-S3] └── Waveshare 7.5" V2 e-ink display (SPI, GxEPD2 library) ``` --- ## PC Environment - OS: Windows 10/11 - GPU: NVIDIA RTX series (RTX 4070 Super available) - Python: 3.11+ - MQTT broker: Mosquitto (localhost:1883) - Whisper server: WhisperLiveKit with `--diarization` flag - Command: `whisperlivekit-server --model large-v3 --diarization --language en` - WebSocket: `ws://localhost:8000/asr` - Diarization model: Streaming Sortformer (SOTA 2025, via WhisperLiveKit) - Fallback: Diart (more stable, slightly older, also integrated in WhisperLiveKit) - Requires pyannote model access (HuggingFace token + model agreement) ### WhisperLiveKit Diarization Setup Notes - Install with diarization extra: `pip install whisperlivekit[diarization-sortformer]` - Sortformer and Voxtral extras are incompatible — install in separate environments - Must accept HuggingFace user conditions for: - `pyannote/segmentation` - `pyannote/segmentation-3.0` - `pyannote/embedding` - Login: `huggingface-cli login` - Streaming Sortformer is marked as in active development — fallback to Diart if unstable --- ## ESP32 Environment - Board: ESP32-S3 (PSRAM required for large font glyph buffers) - Framework: Arduino via PlatformIO - Display: Waveshare 7.5" V2 (800×480 pixels, black/white) - Display library: GxEPD2 - MQTT library: PubSubClient (increase buffer: `client.setBufferSize(512)`) - Build tool: PlatformIO (VSCode) ### SPI Wiring (Waveshare 7.5" V2 → ESP32) | Display Pin | ESP32 Pin | |---|---| | BUSY | GPIO 4 | | RST | GPIO 16 | | DC | GPIO 17 | | CS | GPIO 5 | | CLK | GPIO 18 | | DIN | GPIO 23 | | GND | GND | | VCC | 3.3V | --- ## MQTT Topics | Topic | Direction | Payload | |---|---|---| | `display/text` | PC → ESP32 | JSON: see payload schema below | | `display/clear` | PC → ESP32 | Empty / any value | | `display/status` | ESP32 → PC | JSON: `{"ready": true}` | ### display/text Payload Schema ```json { "speaker": "PASTOR JOHN", "speaker_changed": true, "lines": [ "...and He said unto them, go", "into all the world and preach" ] } ``` - `speaker`: resolved name string, or `null` if unknown/unnamed - `speaker_changed`: `true` triggers full display refresh + speaker header redraw - `lines`: array of pre-wrapped strings, max 40 chars each, max 3 items --- ## Key Files ### `bridge/bridge.py` Main orchestrator. Connects to Whisper WebSocket and Mosquitto. Receives incremental diarized transcription. Buffers text. Resolves speaker names. Publishes MQTT payloads. **WebSocket message fields from WhisperLiveKit (with diarization):** ```json { "text": "and He said unto them", "speaker": "SPEAKER_0", "is_final": true, "start": 12.4, "end": 15.1 } ``` **Bridge logic:** 1. On each `is_final` segment, extract `text` and `speaker` 2. Resolve `speaker` → name via `speaker_registry` 3. If speaker is unknown, notify `admin_ui` (via queue or callback) 4. Accumulate text into rolling buffer 5. On sentence boundary or 4s timeout, word-wrap and publish to MQTT 6. Set `speaker_changed: true` if speaker differs from last published segment ### `bridge/speaker_registry.py` Manages the session-persistent mapping of `SPEAKER_N` IDs to real names. ```python # Core interface registry = SpeakerRegistry() registry.assign(speaker_id="SPEAKER_0", name="Pastor John") name = registry.resolve("SPEAKER_0") # Returns "Pastor John" or None registry.is_known("SPEAKER_1") # Returns False registry.save_session() # Persist to JSON for the session ``` - Session data stored in `bridge/sessions/YYYY-MM-DD.json` - v2: will also store voice embeddings per speaker for cross-session recognition ### `bridge/admin_ui.py` Lightweight Tkinter window. Runs in a separate thread alongside bridge.py. **Behaviour:** - Displays current speaker label and resolved name (or "Unknown") - When a new unknown `SPEAKER_N` is detected, shows a prompt: "New speaker detected. Who is this?" - Operator types name and hits Enter - Calls `registry.assign()` and the display updates immediately - Also shows a manual override: operator can retype any name at any time ### `esp32/src/main.cpp` ESP32 firmware. WiFi + MQTT client. Receives JSON payloads and renders to e-ink. **Display rendering logic:** - On `speaker_changed: true`: full refresh, print speaker name in large CAPS on line 1, then print text lines below - On `speaker_changed: false`: partial refresh, overwrite text lines only (speaker header stays) - Track partial refresh count; force full refresh every 10 cycles to clear ghosting - Font: large enough for ~40 chars across 800px (approx FreeSans 18–24pt at this resolution) --- ## Display Layout (800×480 pixels) ``` ┌────────────────────────────────────────────────┐ ← full width │ PASTOR JOHN │ ← speaker name, top ~80px, bold/large │────────────────────────────────────────────────│ │ ...and He said unto them, go into all the │ ← text line 1 │ world and preach the gospel to every │ ← text line 2 │ creature. He that believeth and is baptised │ ← text line 3 └────────────────────────────────────────────────┘ ``` - Speaker name zone: top ~80px - Text zone: remaining ~380px, 3 lines at ~120px each - On speaker change: full clear, redraw both zones - On same speaker new text: partial refresh text zone only --- ## Speaker Diarization Notes ### v1 — Operator-Assisted Naming - Zero prep before service - admin_ui.py shows prompt when new `SPEAKER_N` appears - Operator at sound desk types name (e.g. "Pastor John") once - Registry holds the mapping for the entire session ### v2 — Voice Enrolment (future) - Record 10–30s of each speaker saying natural speech (not word lists) - Extract speaker embedding using pyannote `SpeakerEmbedding` pipeline - Store embedding in `bridge/profiles/.npy` - At runtime, compare incoming `SPEAKER_N` embedding to stored profiles - If cosine similarity > threshold (~0.85), auto-assign name - Fall back to operator prompt if no match above threshold ### Known Diarization Constraints - Streaming Sortformer tracks 2–4+ speakers reliably - Works best with clean, low-noise audio — direct mixer feed strongly preferred - Background music (worship) may confuse diarization; consider muting music channel on the transcription input - Congregation responses ("Amen", "Hallelujah") may appear as brief unknown speakers — consider a minimum-duration filter (~2s) before triggering a speaker prompt --- ## Design Constraints & Open Questions - [ ] Streaming Sortformer stability in WhisperLiveKit — test early; fall back to Diart if needed - [ ] Minimum speaker segment duration before triggering name prompt (avoid congregation one-liners) - [ ] Partial refresh ghosting — determine optimal full-refresh interval for the chosen display - [ ] ESP32-S3 PSRAM: confirm font glyph buffer fits; WROOM (no PSRAM) likely insufficient for large fonts - [ ] Word-wrap edge cases: long proper nouns, scripture references, place names - [ ] Session save/restore: if PC crashes mid-service, can operator reload speaker assignments quickly? - [ ] Audio routing on Windows: ensure Whisper receives the mixer/mic channel, not system audio --- ## Testing Approach 1. **Whisper standalone**: speak into mic, verify text output in browser at `http://localhost:8000` 2. **Diarization standalone**: two people alternate speaking, verify `SPEAKER_0` / `SPEAKER_1` labels in WS output 3. **Registry + bridge**: run bridge.py, verify name prompts appear in admin_ui.py, verify MQTT payloads via `mosquitto_sub -t display/#` 4. **ESP32 display**: send static MQTT messages manually before connecting bridge 5. **End-to-end**: full pipeline test with recorded sermon audio (mix of 2–3 speakers) 6. **In-situ trial**: 1–2 Sunday services with a volunteer congregant providing feedback --- ## Development Sequence (Suggested) 1. Get WhisperLiveKit running with `--diarization` flag, confirm WS output includes speaker labels 2. Write `bridge.py` (transcription only, no diarization yet) → verify MQTT publish works 3. Add `speaker_registry.py` and `admin_ui.py` → test name mapping loop 4. Integrate diarization into bridge — handle `speaker_changed` logic 5. Write ESP32 firmware — basic text display 6. Add speaker header zone and refresh logic to ESP32 firmware 7. Full end-to-end test on bench 8. Church trial