CLAUDE.md — AI Development Context

This file provides context for AI-assisted development sessions on the Church Live Transcription Display project.

Project Summary

A live captioning system for deaf/hard-of-hearing church congregants. A Windows PC captures audio, transcribes it locally using Whisper (GPU-accelerated), performs real-time speaker diarization, maps anonymous speaker IDs to real names, and sends speaker-tagged rolling text over MQTT to an ESP32 driving a large e-ink display. No cloud services. No internet required during operation.

Architecture

[Audio source]
     ↓ (USB mic or mixer line-in)
[Windows PC]
  ├── WhisperLiveKit
  │     ├── Whisper large-v3 (transcription)
  │     └── Streaming Sortformer (real-time speaker diarization)
  │     WebSocket output: ws://localhost:8000/asr
  │
  ├── Mosquitto MQTT broker (port 1883)
  │
  ├── bridge.py
  │     ├── Subscribes to Whisper WebSocket
  │     ├── Receives: {text, speaker_id, is_final, ...}
  │     ├── Resolves speaker_id → name via speaker_registry
  │     ├── Buffers text to sentence boundary
  │     └── Publishes JSON payload to MQTT topic display/text
  │
  └── admin_ui.py (Tkinter)
        ├── Shows "New speaker detected" prompts
        ├── Operator types name once per unknown speaker
        └── Updates speaker_registry in real time

     ↓ WiFi / MQTT
[ESP32-S3]
  └── Waveshare 7.5" V2 e-ink display (SPI, GxEPD2 library)

PC Environment

OS: Windows 10/11
GPU: NVIDIA RTX series (RTX 4070 Super available)
Python: 3.11+
MQTT broker: Mosquitto (localhost:1883)
Whisper server: WhisperLiveKit with --diarization flag
- Command: whisperlivekit-server --model large-v3 --diarization --language en
- WebSocket: ws://localhost:8000/asr
Diarization model: Streaming Sortformer (SOTA 2025, via WhisperLiveKit)
- Fallback: Diart (more stable, slightly older, also integrated in WhisperLiveKit)
- Requires pyannote model access (HuggingFace token + model agreement)

WhisperLiveKit Diarization Setup Notes

Install with diarization extra: pip install whisperlivekit[diarization-sortformer]
Sortformer and Voxtral extras are incompatible — install in separate environments
Must accept HuggingFace user conditions for:
- pyannote/segmentation
- pyannote/segmentation-3.0
- pyannote/embedding
Login: huggingface-cli login
Streaming Sortformer is marked as in active development — fallback to Diart if unstable

ESP32 Environment

Board: ESP32-S3 (PSRAM required for large font glyph buffers)
Framework: Arduino via PlatformIO
Display: Waveshare 7.5" V2 (800×480 pixels, black/white)
Display library: GxEPD2
MQTT library: PubSubClient (increase buffer: client.setBufferSize(512))
Build tool: PlatformIO (VSCode)

SPI Wiring (Waveshare 7.5" V2 → ESP32)

Display Pin	ESP32 Pin
BUSY	GPIO 4
RST	GPIO 16
DC	GPIO 17
CS	GPIO 5
CLK	GPIO 18
DIN	GPIO 23
GND	GND
VCC	3.3V

MQTT Topics

Topic	Direction	Payload
`display/text`	PC → ESP32	JSON: see payload schema below
`display/clear`	PC → ESP32	Empty / any value
`display/status`	ESP32 → PC	JSON: `{"ready": true}`

display/text Payload Schema

{
  "speaker": "PASTOR JOHN",
  "speaker_changed": true,
  "lines": [
    "...and He said unto them, go",
    "into all the world and preach"
  ]
}

speaker: resolved name string, or null if unknown/unnamed
speaker_changed: true triggers full display refresh + speaker header redraw
lines: array of pre-wrapped strings, max 40 chars each, max 3 items

Key Files

`bridge/bridge.py`

Main orchestrator. Connects to Whisper WebSocket and Mosquitto. Receives incremental diarized transcription. Buffers text. Resolves speaker names. Publishes MQTT payloads.

WebSocket message fields from WhisperLiveKit (with diarization):

{
  "text": "and He said unto them",
  "speaker": "SPEAKER_0",
  "is_final": true,
  "start": 12.4,
  "end": 15.1
}

Bridge logic:

On each is_final segment, extract text and speaker
Resolve speaker → name via speaker_registry
If speaker is unknown, notify admin_ui (via queue or callback)
Accumulate text into rolling buffer
On sentence boundary or 4s timeout, word-wrap and publish to MQTT
Set speaker_changed: true if speaker differs from last published segment

`bridge/speaker_registry.py`

Manages the session-persistent mapping of SPEAKER_N IDs to real names.

# Core interface
registry = SpeakerRegistry()
registry.assign(speaker_id="SPEAKER_0", name="Pastor John")
name = registry.resolve("SPEAKER_0")  # Returns "Pastor John" or None
registry.is_known("SPEAKER_1")        # Returns False
registry.save_session()               # Persist to JSON for the session

Session data stored in bridge/sessions/YYYY-MM-DD.json
v2: will also store voice embeddings per speaker for cross-session recognition

`bridge/admin_ui.py`

Lightweight Tkinter window. Runs in a separate thread alongside bridge.py.

Behaviour:

Displays current speaker label and resolved name (or "Unknown")
When a new unknown SPEAKER_N is detected, shows a prompt: "New speaker detected. Who is this?"
Operator types name and hits Enter
Calls registry.assign() and the display updates immediately
Also shows a manual override: operator can retype any name at any time

`esp32/src/main.cpp`

ESP32 firmware. WiFi + MQTT client. Receives JSON payloads and renders to e-ink.

Display rendering logic:

On speaker_changed: true: full refresh, print speaker name in large CAPS on line 1, then print text lines below
On speaker_changed: false: partial refresh, overwrite text lines only (speaker header stays)
Track partial refresh count; force full refresh every 10 cycles to clear ghosting
Font: large enough for ~40 chars across 800px (approx FreeSans 18–24pt at this resolution)

Display Layout (800×480 pixels)

┌────────────────────────────────────────────────┐  ← full width
│ PASTOR JOHN                                    │  ← speaker name, top ~80px, bold/large
│────────────────────────────────────────────────│
│ ...and He said unto them, go into all the      │  ← text line 1
│ world and preach the gospel to every           │  ← text line 2
│ creature. He that believeth and is baptised    │  ← text line 3
└────────────────────────────────────────────────┘

Speaker name zone: top ~80px
Text zone: remaining ~380px, 3 lines at ~120px each
On speaker change: full clear, redraw both zones
On same speaker new text: partial refresh text zone only

Speaker Diarization Notes

v1 — Operator-Assisted Naming

Zero prep before service
admin_ui.py shows prompt when new SPEAKER_N appears
Operator at sound desk types name (e.g. "Pastor John") once
Registry holds the mapping for the entire session

v2 — Voice Enrolment (future)

Record 10–30s of each speaker saying natural speech (not word lists)
Extract speaker embedding using pyannote SpeakerEmbedding pipeline
Store embedding in bridge/profiles/<name>.npy
At runtime, compare incoming SPEAKER_N embedding to stored profiles
If cosine similarity > threshold (~0.85), auto-assign name
Fall back to operator prompt if no match above threshold

Known Diarization Constraints

Streaming Sortformer tracks 2–4+ speakers reliably
Works best with clean, low-noise audio — direct mixer feed strongly preferred
Background music (worship) may confuse diarization; consider muting music channel on the transcription input
Congregation responses ("Amen", "Hallelujah") may appear as brief unknown speakers — consider a minimum-duration filter (~2s) before triggering a speaker prompt

Design Constraints & Open Questions

Streaming Sortformer stability in WhisperLiveKit — test early; fall back to Diart if needed
Minimum speaker segment duration before triggering name prompt (avoid congregation one-liners)
Partial refresh ghosting — determine optimal full-refresh interval for the chosen display
ESP32-S3 PSRAM: confirm font glyph buffer fits; WROOM (no PSRAM) likely insufficient for large fonts
Word-wrap edge cases: long proper nouns, scripture references, place names
Session save/restore: if PC crashes mid-service, can operator reload speaker assignments quickly?
Audio routing on Windows: ensure Whisper receives the mixer/mic channel, not system audio

Testing Approach

Whisper standalone: speak into mic, verify text output in browser at http://localhost:8000
Diarization standalone: two people alternate speaking, verify SPEAKER_0 / SPEAKER_1 labels in WS output
Registry + bridge: run bridge.py, verify name prompts appear in admin_ui.py, verify MQTT payloads via mosquitto_sub -t display/#
ESP32 display: send static MQTT messages manually before connecting bridge
End-to-end: full pipeline test with recorded sermon audio (mix of 2–3 speakers)
In-situ trial: 1–2 Sunday services with a volunteer congregant providing feedback

Development Sequence (Suggested)

Get WhisperLiveKit running with --diarization flag, confirm WS output includes speaker labels
Write bridge.py (transcription only, no diarization yet) → verify MQTT publish works
Add speaker_registry.py and admin_ui.py → test name mapping loop
Integrate diarization into bridge — handle speaker_changed logic
Write ESP32 firmware — basic text display
Add speaker header zone and refresh logic to ESP32 firmware
Full end-to-end test on bench
Church trial

CLAUDE.md 10 KB История Исходник