# CLAUDE.md — AI Development Context

This file provides context for AI-assisted development sessions on the Church Live Transcription Display project.

---

## Project Summary

A live captioning system for deaf/hard-of-hearing church congregants. A Windows PC captures audio, transcribes it locally using Whisper (GPU-accelerated), performs real-time speaker diarization, maps anonymous speaker IDs to real names, and serves a fullscreen display page over the local WiFi network. Any tablet, TV, or browser-capable device on the same network can act as the display. No cloud services. No internet required during operation.

---

## Architecture

```text
[Audio source]
     ↓ (USB mic or mixer line-in)
[Windows PC]
  ├── WhisperLiveKit (port 8000)
  │     ├── Whisper large-v3 (transcription)
  │     └── diart / pyannote (real-time speaker diarization)
  │     WebSocket output: ws://localhost:8000/asr
  │
  ├── Mosquitto MQTT broker (port 1883, internal bus)
  │
  ├── bridge.py
  │     ├── Subscribes to Whisper WebSocket
  │     ├── Receives: {text, speaker_id, is_final, ...}
  │     ├── Resolves speaker_id → name via speakers.json
  │     ├── Buffers text to sentence boundary
  │     └── Publishes JSON payload to MQTT topic display/text
  │
  └── admin.py (port 8001)
        ├── Speaker name management (REST API + web UI)
        ├── Per-speaker voice sample library
        ├── Test recording playback
        ├── Subscribes to MQTT display/text
        └── /display — fullscreen display page (SSE push to browsers)

     ↓ WiFi (browser on local network)
[Any tablet / TV / browser-capable device]
  └── http://[PC-IP]:8001/display   ← fullscreen display page
```

Multiple display devices can connect simultaneously.

---

## PC Environment

- OS: Windows 10/11
- GPU: NVIDIA RTX series (RTX 4070 Super tested)
- Python: 3.12 (required — PyTorch wheels not published for 3.13+ yet)
- MQTT broker: Mosquitto (localhost:1883)
- Whisper server: WhisperLiveKit launched via `bridge/whisper_launcher.py`
  - Command: `python bridge\whisper_launcher.py --model large-v3 --lan en --diarization-backend diart`
  - WebSocket: `ws://localhost:8000/asr`
- Diarization: diart (pyannote.audio streaming), activated via `--diarization-backend diart`
  - Requires HuggingFace token and accepted licence for `pyannote/speaker-diarization-3.1` and `pyannote/segmentation-3.0`
  - Sortformer backend exists but requires NVIDIA NeMo — not installed; diart is the active backend

RTX 5060 Ti / CUDA 13 note: ctranslate2 must be pinned to 4.5.0. Install nvidia-cublas-cu12, nvidia-cudnn-cu12, nvidia-cuda-runtime-cu12 explicitly. setuptools must be <82 to avoid pkg_resources import error. Confirmed working: CUDA devices: 1.


### WhisperLiveKit Launch Notes

`bridge/whisper_launcher.py` must be used instead of `wlk` directly. It applies two patches before loading WhisperLiveKit:

1. **ffmpeg PATH** — adds `imageio-ffmpeg` bundled binary to PATH so `whisperlivekit.ffmpeg_manager` can spawn ffmpeg without a system-wide install
2. **torchaudio shim** — injects `torchaudio.set_audio_backend = lambda b: None` before diart is imported; diart calls this function at module load time but it was removed in torchaudio 2.x

### CUDA Notes

- PyTorch must be installed from the CUDA index (`--index-url https://download.pytorch.org/whl/cu124`)
- CUDA Toolkit 12.x must be separately installed from NVIDIA (provides `cublas64_12.dll`)
- Without CUDA, WhisperLiveKit falls back to CPU; large-v3 on CPU is ~15× slower than real-time — not viable for live services
- GPU target: RTX 4070 Super runs large-v3 comfortably in real-time

---

## Display Environment

The display is a fullscreen browser page served by `admin.py` at `/display`.

- **URL**: `http://[PC-IP]:8001/display` — open on any tablet, TV, or spare device on the same WiFi
- **Push mechanism**: Server-Sent Events (SSE) from `admin.py` — `admin.py` subscribes to MQTT and forwards display/text payloads to connected browsers via SSE
- **Layout**: Speaker name header at top, 3 rolling lines of transcription text below; font scales to screen size
- **Full-screen**: press F11 in browser (or use guided kiosk mode on tablet)
- **Multiple simultaneous displays**: each browser is an independent SSE subscriber

No microcontroller, firmware, or hardware assembly is required.

---

## MQTT Topics

| Topic | Direction | Payload |
| --- | --- | --- |
| `display/text` | bridge → admin.py → display browsers | JSON: see schema below |
| `display/clear` | bridge → admin.py → display browsers | Empty |

### display/text Payload Schema

```json
{
  "lines": [
    "PASTOR JOHN",
    "...and He said unto them, go",
    "into all the world and preach"
  ]
}
```

- `lines`: array of strings, max `DISPLAY_LINES` items (currently 3); speaker name injected as first line on speaker change
- Bridge pre-wraps text at `MAX_LINE_CHARS` (38) using `textwrap.wrap`

---

## Key Files

### `bridge/bridge.py`

Main audio pipeline. Headless — no UI. Connects to Whisper WebSocket and Mosquitto.

**Current state:**

- `BridgeState` class holds all mutable state (thread-safe via `threading.Lock`)
- `speaker_names`: dict loaded from `speakers.json`, polled for changes every 5s via `_speaker_reloader()`
- `push_final()`: accumulates text, detects speaker change, flushes on sentence boundary or timeout
- `_flush()`: word-wraps with `textwrap.wrap(text, 38)`, maintains 3-line rolling display, injects `[SPEAKER NAME]` label on speaker change, publishes to MQTT
- `_choose_audio_device()`: lists input devices, respects `AUDIO_DEVICE` config constant
- Audio path: `sounddevice.InputStream` → asyncio queue → WebSocket chunks to WhisperLiveKit

**Config constants** (top of file):

- `MQTT_HOST`, `WS_URL`, `SAMPLE_RATE=16000`, `BLOCKSIZE=4096`
- `SENTENCE_TIMEOUT=4.0`, `MAX_LINE_CHARS=38`, `DISPLAY_LINES=3`
- `AUDIO_DEVICE=None` — set to an integer index to force a specific microphone

### `bridge/admin.py`

FastAPI web server on port 8001. Single-file — HTML/CSS/JS embedded as a Python string.

**Endpoints:**

- `GET /` — speaker management web UI
- `GET|POST /api/speakers` — list / add speakers
- `PUT|DELETE /api/speakers/{sid}` — rename / remove speaker
- `POST|GET /api/speakers/{sid}/recording` — upload / serve per-speaker voice sample
- `POST /api/test/upload` — upload full-service test recording
- `GET /api/test/files` — list test recordings
- `DELETE /api/test/files/{filename}` — delete test recording
- `POST /api/test/start` — stream test recording to WhisperLiveKit (via `_stream_file()`)
- `POST /api/test/stop` — cancel active playback
- `GET /api/test/status` — playback progress / state
- `GET /display` — fullscreen display page *(not yet implemented)*
- `GET /api/display/stream` — SSE endpoint for display page *(not yet implemented)*

**Test playback**: `_stream_file()` is an asyncio task that reads audio via `miniaudio.stream_file()` (handles WAV/MP3/FLAC/OGG/M4A, resamples to 16kHz mono) and streams chunks to `ws://localhost:8000/asr`, mimicking live microphone input.

### `bridge/whisper_launcher.py`

Startup wrapper for WhisperLiveKit. Applies ffmpeg PATH fix and torchaudio shim, then calls `whisperlivekit.cli.main()`. Used by `start.bat` instead of `wlk` directly.

### `bridge/speakers.json`

Auto-created on first run. Format: `{"SPEAKER_00": "Pastor", "SPEAKER_01": "Reader", ...}`. Seeded with 4 defaults. Persists across sessions. Written by both `bridge.py` and `admin.py`; `bridge.py` polls mtime every 5s to pick up admin changes.

### `bridge/requirements.txt`

```text
paho-mqtt>=2.0
websockets>=12.0
sounddevice>=0.4.6
numpy>=1.24
fastapi>=0.111
uvicorn>=0.29
python-multipart>=0.0.9
miniaudio>=1.59
imageio-ffmpeg>=2.9
```

---

## Display Layout

Browser-based, scales to any screen size.

```text
┌─────────────────────────────────────────────────┐
│ PASTOR JOHN                                     │  ← speaker name, prominent, top section
│─────────────────────────────────────────────────│
│ ...and He said unto them, go into all the       │  ← line 1
│ world and preach the gospel to every            │  ← line 2
│ creature. He that believeth and is baptised     │  ← line 3
└─────────────────────────────────────────────────┘
```

- Font size scales with viewport width — target readability at 3–5 metres
- High contrast: white text on dark background recommended for bright church environments
- Speaker name shown only when speaker changes (not repeated per line)
- Instant update via SSE — no refresh flash

---

## Speaker Diarization Notes

### Active approach — diart (pyannote.audio)

- Launched via `--diarization-backend diart` in `whisper_launcher.py`
- torchaudio compatibility shim applied in launcher (set_audio_backend removed in 2.x)
- Tracks 2–4+ speakers reliably in clean audio conditions
- Works best with direct mixer feed; background music may confuse diarization
- Congregation responses ("Amen", "Hallelujah") appear as brief unknown speakers — minimum-duration filter (~2s) before triggering admin alert is a future improvement

### v1 — Operator-Assisted Naming (current)

- New `SPEAKER_XX` IDs appear automatically in the admin web table within 5s (via speakers.json polling)
- Operator types the name in the table row and saves — takes effect immediately
- Speaker names persist across sessions in speakers.json

### v2 — Voice Enrolment (planned)

- Upload 10–30s clear speech sample per speaker via admin page Voice Sample column (already implemented)
- Extract embedding using pyannote `SpeakerEmbedding` pipeline
- At runtime, compare incoming `SPEAKER_N` embedding to stored profiles
- Auto-assign name if cosine similarity > threshold (~0.85); fall back to operator prompt otherwise
- Embeddings stored in `bridge/profiles/<name>.npy`

---

## Design Constraints & Open Questions

- [ ] Display page `/display` not yet built — next major task
- [ ] SSE push from admin.py to display browsers — requires admin.py to subscribe to MQTT or receive updates from bridge.py via shared state
- [ ] Minimum speaker segment duration before adding to admin table (avoid congregation one-liners populating 50 rows)
- [ ] Voice enrolment v2 — pyannote.audio is installed, extraction pipeline not yet written
- [ ] Word-wrap edge cases: long proper nouns, scripture references
- [ ] Session save/restore: if PC crashes mid-service, speakers.json persists so names reload immediately on restart
- [ ] Audio routing on Windows: ensure Whisper receives the mixer/mic channel, not system audio
- [ ] CUDA Toolkit 12.x installation required for GPU acceleration (cublas64_12.dll)

---

## Testing Approach

1. **Whisper standalone**: speak into mic, verify text in browser at `http://localhost:8000`
2. **Diarization**: two people alternate speaking, verify `SPEAKER_00` / `SPEAKER_01` labels in WS output
3. **Bridge**: run `bridge.py`, verify MQTT payloads via `mosquitto_sub -t display/#`
4. **Admin**: open `http://localhost:8001`, verify speaker rows appear, rename one, confirm bridge picks up the change within 5s
5. **Test playback**: upload a full service recording via admin, press Play at 4×, verify transcription appears in MQTT and display
6. **Display page**: open `http://[PC-IP]:8001/display` on tablet, verify text updates in real time
7. **In-situ trial**: 1–2 Sunday services with a volunteer congregant providing feedback

---

## Development Sequence (Remaining)

1. Build `/display` fullscreen browser page in `admin.py`
2. Add SSE endpoint (`/api/display/stream`) in `admin.py` — subscribe to MQTT, push payloads to browsers
3. Style display page: large font, dark background, speaker header, 3-line rolling text
4. Install CUDA Toolkit 12.x on the production PC to enable GPU acceleration
5. Voice enrolment v2 — extract pyannote embeddings from uploaded samples, add matching logic to bridge
6. Church deployment trial

---

## Further Enhancements

- Convert speakers.json to remote database for multi-event / multi-location usage
- Transcription log: table of speaker name + first sentence of each turn, exportable after the service
- Minimum-duration filter: suppress `SPEAKER_XX` rows for segments under ~2s (congregation responses)