CLAUDE.md 10 KB

CLAUDE.md — AI Development Context

This file provides context for AI-assisted development sessions on the Church Live Transcription Display project.


Project Summary

A live captioning system for deaf/hard-of-hearing church congregants. A Windows PC captures audio, transcribes it locally using Whisper (GPU-accelerated), performs real-time speaker diarization, maps anonymous speaker IDs to real names, and sends speaker-tagged rolling text over MQTT to an ESP32 driving a large e-ink display. No cloud services. No internet required during operation.


Architecture

[Audio source]
     ↓ (USB mic or mixer line-in)
[Windows PC]
  ├── WhisperLiveKit
  │     ├── Whisper large-v3 (transcription)
  │     └── Streaming Sortformer (real-time speaker diarization)
  │     WebSocket output: ws://localhost:8000/asr
  │
  ├── Mosquitto MQTT broker (port 1883)
  │
  ├── bridge.py
  │     ├── Subscribes to Whisper WebSocket
  │     ├── Receives: {text, speaker_id, is_final, ...}
  │     ├── Resolves speaker_id → name via speaker_registry
  │     ├── Buffers text to sentence boundary
  │     └── Publishes JSON payload to MQTT topic display/text
  │
  └── admin_ui.py (Tkinter)
        ├── Shows "New speaker detected" prompts
        ├── Operator types name once per unknown speaker
        └── Updates speaker_registry in real time

     ↓ WiFi / MQTT
[ESP32-S3]
  └── Waveshare 7.5" V2 e-ink display (SPI, GxEPD2 library)

PC Environment

  • OS: Windows 10/11
  • GPU: NVIDIA RTX series (RTX 4070 Super available)
  • Python: 3.11+
  • MQTT broker: Mosquitto (localhost:1883)
  • Whisper server: WhisperLiveKit with --diarization flag
    • Command: whisperlivekit-server --model large-v3 --diarization --language en
    • WebSocket: ws://localhost:8000/asr
  • Diarization model: Streaming Sortformer (SOTA 2025, via WhisperLiveKit)
    • Fallback: Diart (more stable, slightly older, also integrated in WhisperLiveKit)
    • Requires pyannote model access (HuggingFace token + model agreement)

WhisperLiveKit Diarization Setup Notes

  • Install with diarization extra: pip install whisperlivekit[diarization-sortformer]
  • Sortformer and Voxtral extras are incompatible — install in separate environments
  • Must accept HuggingFace user conditions for:
    • pyannote/segmentation
    • pyannote/segmentation-3.0
    • pyannote/embedding
  • Login: huggingface-cli login
  • Streaming Sortformer is marked as in active development — fallback to Diart if unstable

ESP32 Environment

  • Board: ESP32-S3 (PSRAM required for large font glyph buffers)
  • Framework: Arduino via PlatformIO
  • Display: Waveshare 7.5" V2 (800×480 pixels, black/white)
  • Display library: GxEPD2
  • MQTT library: PubSubClient (increase buffer: client.setBufferSize(512))
  • Build tool: PlatformIO (VSCode)

SPI Wiring (Waveshare 7.5" V2 → ESP32)

Display Pin ESP32 Pin
BUSY GPIO 4
RST GPIO 16
DC GPIO 17
CS GPIO 5
CLK GPIO 18
DIN GPIO 23
GND GND
VCC 3.3V

MQTT Topics

Topic Direction Payload
display/text PC → ESP32 JSON: see payload schema below
display/clear PC → ESP32 Empty / any value
display/status ESP32 → PC JSON: {"ready": true}

display/text Payload Schema

{
  "speaker": "PASTOR JOHN",
  "speaker_changed": true,
  "lines": [
    "...and He said unto them, go",
    "into all the world and preach"
  ]
}
  • speaker: resolved name string, or null if unknown/unnamed
  • speaker_changed: true triggers full display refresh + speaker header redraw
  • lines: array of pre-wrapped strings, max 40 chars each, max 3 items

Key Files

bridge/bridge.py

Main orchestrator. Connects to Whisper WebSocket and Mosquitto. Receives incremental diarized transcription. Buffers text. Resolves speaker names. Publishes MQTT payloads.

WebSocket message fields from WhisperLiveKit (with diarization):

{
  "text": "and He said unto them",
  "speaker": "SPEAKER_0",
  "is_final": true,
  "start": 12.4,
  "end": 15.1
}

Bridge logic:

  1. On each is_final segment, extract text and speaker
  2. Resolve speaker → name via speaker_registry
  3. If speaker is unknown, notify admin_ui (via queue or callback)
  4. Accumulate text into rolling buffer
  5. On sentence boundary or 4s timeout, word-wrap and publish to MQTT
  6. Set speaker_changed: true if speaker differs from last published segment

bridge/speaker_registry.py

Manages the session-persistent mapping of SPEAKER_N IDs to real names.

# Core interface
registry = SpeakerRegistry()
registry.assign(speaker_id="SPEAKER_0", name="Pastor John")
name = registry.resolve("SPEAKER_0")  # Returns "Pastor John" or None
registry.is_known("SPEAKER_1")        # Returns False
registry.save_session()               # Persist to JSON for the session
  • Session data stored in bridge/sessions/YYYY-MM-DD.json
  • v2: will also store voice embeddings per speaker for cross-session recognition

bridge/admin_ui.py

Lightweight Tkinter window. Runs in a separate thread alongside bridge.py.

Behaviour:

  • Displays current speaker label and resolved name (or "Unknown")
  • When a new unknown SPEAKER_N is detected, shows a prompt: "New speaker detected. Who is this?"
  • Operator types name and hits Enter
  • Calls registry.assign() and the display updates immediately
  • Also shows a manual override: operator can retype any name at any time

esp32/src/main.cpp

ESP32 firmware. WiFi + MQTT client. Receives JSON payloads and renders to e-ink.

Display rendering logic:

  • On speaker_changed: true: full refresh, print speaker name in large CAPS on line 1, then print text lines below
  • On speaker_changed: false: partial refresh, overwrite text lines only (speaker header stays)
  • Track partial refresh count; force full refresh every 10 cycles to clear ghosting
  • Font: large enough for ~40 chars across 800px (approx FreeSans 18–24pt at this resolution)

Display Layout (800×480 pixels)

┌────────────────────────────────────────────────┐  ← full width
│ PASTOR JOHN                                    │  ← speaker name, top ~80px, bold/large
│────────────────────────────────────────────────│
│ ...and He said unto them, go into all the      │  ← text line 1
│ world and preach the gospel to every           │  ← text line 2
│ creature. He that believeth and is baptised    │  ← text line 3
└────────────────────────────────────────────────┘
  • Speaker name zone: top ~80px
  • Text zone: remaining ~380px, 3 lines at ~120px each
  • On speaker change: full clear, redraw both zones
  • On same speaker new text: partial refresh text zone only

Speaker Diarization Notes

v1 — Operator-Assisted Naming

  • Zero prep before service
  • admin_ui.py shows prompt when new SPEAKER_N appears
  • Operator at sound desk types name (e.g. "Pastor John") once
  • Registry holds the mapping for the entire session

v2 — Voice Enrolment (future)

  • Record 10–30s of each speaker saying natural speech (not word lists)
  • Extract speaker embedding using pyannote SpeakerEmbedding pipeline
  • Store embedding in bridge/profiles/<name>.npy
  • At runtime, compare incoming SPEAKER_N embedding to stored profiles
  • If cosine similarity > threshold (~0.85), auto-assign name
  • Fall back to operator prompt if no match above threshold

Known Diarization Constraints

  • Streaming Sortformer tracks 2–4+ speakers reliably
  • Works best with clean, low-noise audio — direct mixer feed strongly preferred
  • Background music (worship) may confuse diarization; consider muting music channel on the transcription input
  • Congregation responses ("Amen", "Hallelujah") may appear as brief unknown speakers — consider a minimum-duration filter (~2s) before triggering a speaker prompt

Design Constraints & Open Questions

  • Streaming Sortformer stability in WhisperLiveKit — test early; fall back to Diart if needed
  • Minimum speaker segment duration before triggering name prompt (avoid congregation one-liners)
  • Partial refresh ghosting — determine optimal full-refresh interval for the chosen display
  • ESP32-S3 PSRAM: confirm font glyph buffer fits; WROOM (no PSRAM) likely insufficient for large fonts
  • Word-wrap edge cases: long proper nouns, scripture references, place names
  • Session save/restore: if PC crashes mid-service, can operator reload speaker assignments quickly?
  • Audio routing on Windows: ensure Whisper receives the mixer/mic channel, not system audio

Testing Approach

  1. Whisper standalone: speak into mic, verify text output in browser at http://localhost:8000
  2. Diarization standalone: two people alternate speaking, verify SPEAKER_0 / SPEAKER_1 labels in WS output
  3. Registry + bridge: run bridge.py, verify name prompts appear in admin_ui.py, verify MQTT payloads via mosquitto_sub -t display/#
  4. ESP32 display: send static MQTT messages manually before connecting bridge
  5. End-to-end: full pipeline test with recorded sermon audio (mix of 2–3 speakers)
  6. In-situ trial: 1–2 Sunday services with a volunteer congregant providing feedback

Development Sequence (Suggested)

  1. Get WhisperLiveKit running with --diarization flag, confirm WS output includes speaker labels
  2. Write bridge.py (transcription only, no diarization yet) → verify MQTT publish works
  3. Add speaker_registry.py and admin_ui.py → test name mapping loop
  4. Integrate diarization into bridge — handle speaker_changed logic
  5. Write ESP32 firmware — basic text display
  6. Add speaker header zone and refresh logic to ESP32 firmware
  7. Full end-to-end test on bench
  8. Church trial