CLAUDE.md 13 KB

CLAUDE.md — AI Development Context

This file provides context for AI-assisted development sessions on the Church Live Transcription Display project.


Project Summary

A live captioning system for deaf/hard-of-hearing church congregants. A Windows PC captures audio, transcribes it locally using Whisper (GPU-accelerated), performs real-time speaker diarization, maps anonymous speaker IDs to real names, and serves a fullscreen display page over the local WiFi network. Any tablet, TV, or browser-capable device on the same network can act as the display. No cloud services. No internet required during operation.


Architecture

[Audio source]
     ↓ (USB mic or mixer line-in)
[Windows PC]
  ├── WhisperLiveKit (port 8000)
  │     ├── Whisper large-v3 (transcription)
  │     └── diart / pyannote (real-time speaker diarization)
  │     WebSocket output: ws://localhost:8000/asr
  │
  ├── Mosquitto MQTT broker (port 1883, internal bus)
  │
  ├── bridge.py
  │     ├── Subscribes to Whisper WebSocket
  │     ├── Receives: {text, speaker_id, is_final, ...}
  │     ├── Resolves speaker_id → name via speakers.json
  │     ├── Buffers text to sentence boundary
  │     └── Publishes JSON payload to MQTT topic display/text
  │
  └── admin.py (port 8001)
        ├── Speaker name management (REST API + web UI)
        ├── Per-speaker voice sample library
        ├── Test recording playback
        ├── Subscribes to MQTT display/text
        └── /display — fullscreen display page (SSE push to browsers)

     ↓ WiFi (browser on local network)
[Any tablet / TV / browser-capable device]
  └── http://[PC-IP]:8001/display   ← fullscreen display page

Multiple display devices can connect simultaneously.


PC Environment

  • OS: Windows 10/11
  • GPU: NVIDIA RTX 5060 Ti 16 GB (production); RTX 4070 Super also tested
  • Python: 3.12 (required — PyTorch wheels not published for 3.13+ yet)
  • MQTT broker: Mosquitto (localhost:1883)
  • Diarization: diart (pyannote.audio streaming) — requires HuggingFace token and accepted licence for pyannote/speaker-diarization-3.1 and pyannote/segmentation-3.0

RTX 5060 Ti / CUDA 13 compatibility

ctranslate2 must be pinned to ==4.5.0. Install the following pip packages explicitly (the CUDA runtime is bundled this way, no system CUDA Toolkit required for ctranslate2 itself):

nvidia-cublas-cu12
nvidia-cudnn-cu12
nvidia-cuda-runtime-cu12

setuptools must be <82 to avoid a pkg_resources import error at startup.

Confirmed working on this machine: CUDA 13.2, CUDA devices: 1.

CUDA Notes

  • CUDA Toolkit 13.2 is installed on the production PC (nvcc --version confirms; nvidia-smi shows driver 595.79)
  • PyTorch installed from the CUDA 12.4 index (--index-url https://download.pytorch.org/whl/cu124) — PyTorch cu124 wheels are forward-compatible with CUDA 13.x drivers
  • Without CUDA, WhisperLiveKit falls back to CPU; large-v3 on CPU is ~15× slower than real-time — not viable for live services
  • Triton kernel warning — at startup you will see Failed to launch Triton kernels, likely due to missing CUDA toolkit. This is misleading — Triton (the Python package) does not support Windows at all. The fallback to a median kernel is expected and harmless. Under the LocalAgreement backend (current), these timing kernels are not used anyway.

Display Environment

The display is a fullscreen browser page served by admin.py at /display.

  • URL: http://[PC-IP]:8001/display — open on any tablet, TV, or spare device on the same WiFi
  • Push mechanism: Server-Sent Events (SSE) from admin.pyadmin.py subscribes to MQTT and forwards display/text payloads to connected browsers via SSE
  • Layout: Speaker name header at top, 3 rolling lines of transcription text below; font scales to screen size
  • Full-screen: press F11 in browser (or use guided kiosk mode on tablet)
  • Multiple simultaneous displays: each browser is an independent SSE subscriber

No microcontroller, firmware, or hardware assembly is required.


MQTT Topics

Topic Direction Payload
display/text bridge → admin.py → display browsers JSON: see schema below
display/clear bridge → admin.py → display browsers Empty

display/text Payload Schema

{
  "lines": [
    "PASTOR JOHN",
    "...and He said unto them, go",
    "into all the world and preach"
  ]
}
  • lines: array of strings, max DISPLAY_LINES items (currently 3); speaker name injected as first line on speaker change
  • Bridge pre-wraps text at MAX_LINE_CHARS (60) using textwrap.wrap; publishes one MQTT message per wrapped line so the display scrolls one line at a time

Key Files

bridge/bridge.py

Main audio pipeline. Headless — no UI. Uses AudioProcessor + TranscriptionEngine directly (no WebSocket), publishes to Mosquitto.

Current state:

  • BridgeState class holds all mutable state (thread-safe via threading.Lock)
  • speaker_names: dict loaded from speakers.json, polled for changes every 5s via _speaker_reloader()
  • push_final(): accumulates text, detects speaker change, flushes on sentence boundary or timeout
  • _flush(): word-wraps with textwrap.wrap(text, 60), publishes one MQTT message per line (so display scrolls one line at a time), injects [SPEAKER NAME] label on speaker change
  • _receive_results(): delta-tracks full concatenated transcript across FrontData.lines to avoid double-counting the growing last segment
  • _choose_audio_device(): lists input devices, respects AUDIO_DEVICE config constant
  • Audio path: sounddevice.InputStream → asyncio queue → audio_processor.process_audio()
  • Inject API (port 8002): POST /inject accepts raw PCM bytes from admin.py test playback

Config constants (top of file):

  • MQTT_HOST, SAMPLE_RATE=16000, BLOCKSIZE=4096
  • SENTENCE_TIMEOUT=4.0, MAX_LINE_CHARS=60, DISPLAY_LINES=3
  • AUDIO_DEVICE=12 — Logitech BRIO; set to None to use Windows default

TranscriptionEngine settings:

  • backend_policy="localagreement" — WhisperStreaming local-agreement algorithm; more accurate than SimulStreaming, ~2s additional latency
  • confidence_validation=True — suppresses low-confidence tokens (reduces hallucinations on breath/pause)
  • Underlying faster-whisper uses beam_size=5 (hardcoded in FasterWhisperASR)

bridge/admin.py

FastAPI web server on port 8001. Single-file — HTML/CSS/JS embedded as a Python string.

Endpoints:

  • GET / — speaker management web UI
  • GET|POST /api/speakers — list / add speakers
  • PUT|DELETE /api/speakers/{sid} — rename / remove speaker
  • POST|GET /api/speakers/{sid}/recording — upload / serve per-speaker voice sample
  • POST /api/test/upload — upload full-service test recording
  • GET /api/test/files — list test recordings
  • DELETE /api/test/files/{filename} — delete test recording
  • POST /api/test/start — stream test recording to WhisperLiveKit (via _stream_file())
  • POST /api/test/stop — cancel active playback
  • GET /api/test/status — playback progress / state
  • GET /display — fullscreen display page (black background, Georgia serif, 3 rolling lines, speaker header in gold)
  • GET /api/display/stream — SSE endpoint; subscribes to MQTT via paho, pushes event: text / event: clear to all connected browsers

Test playback: _stream_file() is an asyncio task that reads audio via miniaudio.stream_file() (handles WAV/MP3/FLAC/OGG/M4A, resamples to 16kHz mono) and POSTs raw PCM chunks to http://127.0.0.1:8002/inject on bridge.py, which queues them ahead of live microphone input.

bridge/whisper_launcher.py

Startup wrapper for WhisperLiveKit. Applies ffmpeg PATH fix and torchaudio shim, then calls whisperlivekit.cli.main(). Used by start.bat instead of wlk directly.

bridge/speakers.json

Auto-created on first run. Format: {"SPEAKER_00": "Pastor", "SPEAKER_01": "Reader", ...}. Seeded with 4 defaults. Persists across sessions. Written by both bridge.py and admin.py; bridge.py polls mtime every 5s to pick up admin changes.

bridge/requirements.txt

paho-mqtt>=2.0
websockets>=12.0
sounddevice>=0.4.6
numpy>=1.24
fastapi>=0.111
uvicorn>=0.29
python-multipart>=0.0.9
miniaudio>=1.59
imageio-ffmpeg>=2.9

Display Layout

Browser-based, scales to any screen size.

┌─────────────────────────────────────────────────┐
│ PASTOR JOHN                                     │  ← speaker name, prominent, top section
│─────────────────────────────────────────────────│
│ ...and He said unto them, go into all the       │  ← line 1
│ world and preach the gospel to every            │  ← line 2
│ creature. He that believeth and is baptised     │  ← line 3
└─────────────────────────────────────────────────┘
  • Font size scales with viewport width — target readability at 3–5 metres
  • High contrast: white text on dark background recommended for bright church environments
  • Speaker name shown only when speaker changes (not repeated per line)
  • Instant update via SSE — no refresh flash

Speaker Diarization Notes

Active approach — diart (pyannote.audio)

  • Launched via --diarization-backend diart in whisper_launcher.py
  • torchaudio compatibility shim applied in launcher (set_audio_backend removed in 2.x)
  • Tracks 2–4+ speakers reliably in clean audio conditions
  • Works best with direct mixer feed; background music may confuse diarization
  • Congregation responses ("Amen", "Hallelujah") appear as brief unknown speakers — minimum-duration filter (~2s) before triggering admin alert is a future improvement

v1 — Operator-Assisted Naming (current)

  • New SPEAKER_XX IDs appear automatically in the admin web table within 5s (via speakers.json polling)
  • Operator types the name in the table row and saves — takes effect immediately
  • Speaker names persist across sessions in speakers.json

v2 — Voice Enrolment (planned)

  • Upload 10–30s clear speech sample per speaker via admin page Voice Sample column (already implemented)
  • Extract embedding using pyannote SpeakerEmbedding pipeline
  • At runtime, compare incoming SPEAKER_N embedding to stored profiles
  • Auto-assign name if cosine similarity > threshold (~0.85); fall back to operator prompt otherwise
  • Embeddings stored in bridge/profiles/<name>.npy

Design Constraints & Open Questions

  • Display page /display — built; fullscreen browser page in admin.py
  • SSE push from admin.py to display browsers — implemented; paho MQTT subscriber in admin.py, loop.call_soon_threadsafe to asyncio queues
  • CUDA Toolkit — installed (13.2); GPU acceleration confirmed working
  • Minimum speaker segment duration before adding to admin table (avoid congregation one-liners populating 50 rows)
  • Voice enrolment v2 — pyannote.audio is installed, extraction pipeline not yet written
  • Word-wrap edge cases: long proper nouns, scripture references
  • Session save/restore: if PC crashes mid-service, speakers.json persists so names reload immediately on restart
  • Audio routing on Windows: ensure Whisper receives the mixer/mic channel, not system audio

Testing Approach

  1. Whisper standalone: speak into mic, verify text in browser at http://localhost:8000
  2. Diarization: two people alternate speaking, verify SPEAKER_00 / SPEAKER_01 labels in WS output
  3. Bridge: run bridge.py, verify MQTT payloads via mosquitto_sub -t display/#
  4. Admin: open http://localhost:8001, verify speaker rows appear, rename one, confirm bridge picks up the change within 5s
  5. Test playback: upload a full service recording via admin, press Play at 4×, verify transcription appears in MQTT and display
  6. Display page: open http://[PC-IP]:8001/display on tablet, verify text updates in real time
  7. In-situ trial: 1–2 Sunday services with a volunteer congregant providing feedback

Development Sequence (Remaining)

  1. Build /display fullscreen browser page — done
  2. SSE push (/api/display/stream) — done
  3. CUDA Toolkit installation — done (13.2)
  4. Voice enrolment v2 — extract pyannote embeddings from uploaded samples, add matching logic to bridge
  5. Church deployment trial

Further Enhancements

  • Convert speakers.json to remote database for multi-event / multi-location usage
  • Transcription log: table of speaker name + first sentence of each turn, exportable after the service
  • Minimum-duration filter: suppress SPEAKER_XX rows for segments under ~2s (congregation responses)