# CLAUDE.md — AI Development Context

This file provides context for AI-assisted development sessions on the Church Live Transcription Display project.

---

## Project Summary

A live captioning system for deaf/hard-of-hearing church congregants. A Windows PC captures audio, transcribes it locally using Whisper (GPU-accelerated), performs real-time speaker diarization, maps anonymous speaker IDs to real names, and sends speaker-tagged rolling text over MQTT to an ESP32 driving a large e-ink display. No cloud services. No internet required during operation.

---

## Architecture

```
[Audio source]
     ↓ (USB mic or mixer line-in)
[Windows PC]
  ├── WhisperLiveKit
  │     ├── Whisper large-v3 (transcription)
  │     └── Streaming Sortformer (real-time speaker diarization)
  │     WebSocket output: ws://localhost:8000/asr
  │
  ├── Mosquitto MQTT broker (port 1883)
  │
  ├── bridge.py
  │     ├── Subscribes to Whisper WebSocket
  │     ├── Receives: {text, speaker_id, is_final, ...}
  │     ├── Resolves speaker_id → name via speaker_registry
  │     ├── Buffers text to sentence boundary
  │     └── Publishes JSON payload to MQTT topic display/text
  │
  └── admin_ui.py (Tkinter)
        ├── Shows "New speaker detected" prompts
        ├── Operator types name once per unknown speaker
        └── Updates speaker_registry in real time

     ↓ WiFi / MQTT
[ESP32-S3]
  └── Waveshare 7.5" V2 e-ink display (SPI, GxEPD2 library)
```

---

## PC Environment

- OS: Windows 10/11
- GPU: NVIDIA RTX series (RTX 4070 Super available)
- Python: 3.11+
- MQTT broker: Mosquitto (localhost:1883)
- Whisper server: WhisperLiveKit with `--diarization` flag
  - Command: `whisperlivekit-server --model large-v3 --diarization --language en`
  - WebSocket: `ws://localhost:8000/asr`
- Diarization model: Streaming Sortformer (SOTA 2025, via WhisperLiveKit)
  - Fallback: Diart (more stable, slightly older, also integrated in WhisperLiveKit)
  - Requires pyannote model access (HuggingFace token + model agreement)

### WhisperLiveKit Diarization Setup Notes
- Install with diarization extra: `pip install whisperlivekit[diarization-sortformer]`
- Sortformer and Voxtral extras are incompatible — install in separate environments
- Must accept HuggingFace user conditions for:
  - `pyannote/segmentation`
  - `pyannote/segmentation-3.0`
  - `pyannote/embedding`
- Login: `huggingface-cli login`
- Streaming Sortformer is marked as in active development — fallback to Diart if unstable

---

## ESP32 Environment

- Board: ESP32-S3 (PSRAM required for large font glyph buffers)
- Framework: Arduino via PlatformIO
- Display: Waveshare 7.5" V2 (800×480 pixels, black/white)
- Display library: GxEPD2
- MQTT library: PubSubClient (increase buffer: `client.setBufferSize(512)`)
- Build tool: PlatformIO (VSCode)

### SPI Wiring (Waveshare 7.5" V2 → ESP32)

| Display Pin | ESP32 Pin |
|---|---|
| BUSY | GPIO 4 |
| RST | GPIO 16 |
| DC | GPIO 17 |
| CS | GPIO 5 |
| CLK | GPIO 18 |
| DIN | GPIO 23 |
| GND | GND |
| VCC | 3.3V |

---

## MQTT Topics

| Topic | Direction | Payload |
|---|---|---|
| `display/text` | PC → ESP32 | JSON: see payload schema below |
| `display/clear` | PC → ESP32 | Empty / any value |
| `display/status` | ESP32 → PC | JSON: `{"ready": true}` |

### display/text Payload Schema

```json
{
  "speaker": "PASTOR JOHN",
  "speaker_changed": true,
  "lines": [
    "...and He said unto them, go",
    "into all the world and preach"
  ]
}
```

- `speaker`: resolved name string, or `null` if unknown/unnamed
- `speaker_changed`: `true` triggers full display refresh + speaker header redraw
- `lines`: array of pre-wrapped strings, max 40 chars each, max 3 items

---

## Key Files

### `bridge/bridge.py`
Main orchestrator. Connects to Whisper WebSocket and Mosquitto. Receives incremental diarized transcription. Buffers text. Resolves speaker names. Publishes MQTT payloads.

**WebSocket message fields from WhisperLiveKit (with diarization):**
```json
{
  "text": "and He said unto them",
  "speaker": "SPEAKER_0",
  "is_final": true,
  "start": 12.4,
  "end": 15.1
}
```

**Bridge logic:**
1. On each `is_final` segment, extract `text` and `speaker`
2. Resolve `speaker` → name via `speaker_registry`
3. If speaker is unknown, notify `admin_ui` (via queue or callback)
4. Accumulate text into rolling buffer
5. On sentence boundary or 4s timeout, word-wrap and publish to MQTT
6. Set `speaker_changed: true` if speaker differs from last published segment

### `bridge/speaker_registry.py`
Manages the session-persistent mapping of `SPEAKER_N` IDs to real names.

```python
# Core interface
registry = SpeakerRegistry()
registry.assign(speaker_id="SPEAKER_0", name="Pastor John")
name = registry.resolve("SPEAKER_0")  # Returns "Pastor John" or None
registry.is_known("SPEAKER_1")        # Returns False
registry.save_session()               # Persist to JSON for the session
```

- Session data stored in `bridge/sessions/YYYY-MM-DD.json`
- v2: will also store voice embeddings per speaker for cross-session recognition

### `bridge/admin_ui.py`
Lightweight Tkinter window. Runs in a separate thread alongside bridge.py.

**Behaviour:**
- Displays current speaker label and resolved name (or "Unknown")
- When a new unknown `SPEAKER_N` is detected, shows a prompt: "New speaker detected. Who is this?"
- Operator types name and hits Enter
- Calls `registry.assign()` and the display updates immediately
- Also shows a manual override: operator can retype any name at any time

### `esp32/src/main.cpp`
ESP32 firmware. WiFi + MQTT client. Receives JSON payloads and renders to e-ink.

**Display rendering logic:**
- On `speaker_changed: true`: full refresh, print speaker name in large CAPS on line 1, then print text lines below
- On `speaker_changed: false`: partial refresh, overwrite text lines only (speaker header stays)
- Track partial refresh count; force full refresh every 10 cycles to clear ghosting
- Font: large enough for ~40 chars across 800px (approx FreeSans 18–24pt at this resolution)

---

## Display Layout (800×480 pixels)

```
┌────────────────────────────────────────────────┐  ← full width
│ PASTOR JOHN                                    │  ← speaker name, top ~80px, bold/large
│────────────────────────────────────────────────│
│ ...and He said unto them, go into all the      │  ← text line 1
│ world and preach the gospel to every           │  ← text line 2
│ creature. He that believeth and is baptised    │  ← text line 3
└────────────────────────────────────────────────┘
```

- Speaker name zone: top ~80px
- Text zone: remaining ~380px, 3 lines at ~120px each
- On speaker change: full clear, redraw both zones
- On same speaker new text: partial refresh text zone only

---

## Speaker Diarization Notes

### v1 — Operator-Assisted Naming
- Zero prep before service
- admin_ui.py shows prompt when new `SPEAKER_N` appears
- Operator at sound desk types name (e.g. "Pastor John") once
- Registry holds the mapping for the entire session

### v2 — Voice Enrolment (future)
- Record 10–30s of each speaker saying natural speech (not word lists)
- Extract speaker embedding using pyannote `SpeakerEmbedding` pipeline
- Store embedding in `bridge/profiles/<name>.npy`
- At runtime, compare incoming `SPEAKER_N` embedding to stored profiles
- If cosine similarity > threshold (~0.85), auto-assign name
- Fall back to operator prompt if no match above threshold

### Known Diarization Constraints
- Streaming Sortformer tracks 2–4+ speakers reliably
- Works best with clean, low-noise audio — direct mixer feed strongly preferred
- Background music (worship) may confuse diarization; consider muting music channel on the transcription input
- Congregation responses ("Amen", "Hallelujah") may appear as brief unknown speakers — consider a minimum-duration filter (~2s) before triggering a speaker prompt

---

## Design Constraints & Open Questions

- [ ] Streaming Sortformer stability in WhisperLiveKit — test early; fall back to Diart if needed
- [ ] Minimum speaker segment duration before triggering name prompt (avoid congregation one-liners)
- [ ] Partial refresh ghosting — determine optimal full-refresh interval for the chosen display
- [ ] ESP32-S3 PSRAM: confirm font glyph buffer fits; WROOM (no PSRAM) likely insufficient for large fonts
- [ ] Word-wrap edge cases: long proper nouns, scripture references, place names
- [ ] Session save/restore: if PC crashes mid-service, can operator reload speaker assignments quickly?
- [ ] Audio routing on Windows: ensure Whisper receives the mixer/mic channel, not system audio

---

## Testing Approach

1. **Whisper standalone**: speak into mic, verify text output in browser at `http://localhost:8000`
2. **Diarization standalone**: two people alternate speaking, verify `SPEAKER_0` / `SPEAKER_1` labels in WS output
3. **Registry + bridge**: run bridge.py, verify name prompts appear in admin_ui.py, verify MQTT payloads via `mosquitto_sub -t display/#`
4. **ESP32 display**: send static MQTT messages manually before connecting bridge
5. **End-to-end**: full pipeline test with recorded sermon audio (mix of 2–3 speakers)
6. **In-situ trial**: 1–2 Sunday services with a volunteer congregant providing feedback

---

## Development Sequence (Suggested)

1. Get WhisperLiveKit running with `--diarization` flag, confirm WS output includes speaker labels
2. Write `bridge.py` (transcription only, no diarization yet) → verify MQTT publish works
3. Add `speaker_registry.py` and `admin_ui.py` → test name mapping loop
4. Integrate diarization into bridge — handle `speaker_changed` logic
5. Write ESP32 firmware — basic text display
6. Add speaker header zone and refresh logic to ESP32 firmware
7. Full end-to-end test on bench
8. Church trial