Ver código fonte

Change to Tablet nad ffmpeg updates

Benjamin Harris 1 mês atrás
pai
commit
bdbd689694

+ 13 - 0
.claude/settings.local.json

@@ -0,0 +1,13 @@
+{
+  "permissions": {
+    "allow": [
+      "Bash(python -c \"import importlib.metadata; ep = [e for e in importlib.metadata.distribution\\('whisperlivekit'\\).entry_points if e.name == 'wlk'][0]; print\\(ep.value\\)\")",
+      "Bash(d:\\\\GIT_REPO\\\\Deaf_Transcription_Service\\\\.venv\\\\Scripts\\\\python.exe -c \"import importlib.metadata; ep = [e for e in importlib.metadata.distribution\\('whisperlivekit'\\).entry_points if e.name == 'wlk'][0]; print\\(ep.value\\)\")",
+      "PowerShell(& \"d:\\\\GIT_REPO\\\\Deaf_Transcription_Service\\\\.venv\\\\Scripts\\\\python.exe\" -c \"import importlib.metadata; ep = [e for e in importlib.metadata.distribution\\('whisperlivekit'\\).entry_points if e.name == 'wlk'][0]; print\\(ep.value\\)\")",
+      "PowerShell(& ffmpeg -version 2>&1)",
+      "PowerShell(& \"d:\\\\GIT_REPO\\\\Deaf_Transcription_Service\\\\.venv\\\\Scripts\\\\python.exe\" -c \"import diart; print\\(diart.__version__\\)\" 2>&1)",
+      "PowerShell(& \"d:\\\\GIT_REPO\\\\Deaf_Transcription_Service\\\\.venv\\\\Scripts\\\\python.exe\" -m pip install imageio-ffmpeg miniaudio --quiet 2>&1)",
+      "PowerShell(& \"d:\\\\GIT_REPO\\\\Deaf_Transcription_Service\\\\.venv\\\\Scripts\\\\python.exe\" \"d:\\\\GIT_REPO\\\\Deaf_Transcription_Service\\\\bridge\\\\whisper_launcher.py\" --help 2>&1)"
+    ]
+  }
+}

+ 2 - 1
.gitignore

@@ -1,3 +1,4 @@
 /.venv
 Installers
-codes.md
+codes.md
+Audio Samples

+ 169 - 156
CLAUDE.md

@@ -6,251 +6,264 @@ This file provides context for AI-assisted development sessions on the Church Li
 
 ## Project Summary
 
-A live captioning system for deaf/hard-of-hearing church congregants. A Windows PC captures audio, transcribes it locally using Whisper (GPU-accelerated), performs real-time speaker diarization, maps anonymous speaker IDs to real names, and sends speaker-tagged rolling text over MQTT to an ESP32 driving a large e-ink display. No cloud services. No internet required during operation.
+A live captioning system for deaf/hard-of-hearing church congregants. A Windows PC captures audio, transcribes it locally using Whisper (GPU-accelerated), performs real-time speaker diarization, maps anonymous speaker IDs to real names, and serves a fullscreen display page over the local WiFi network. Any tablet, TV, or browser-capable device on the same network can act as the display. No cloud services. No internet required during operation.
 
 ---
 
 ## Architecture
 
-```
+```text
 [Audio source]
      ↓ (USB mic or mixer line-in)
 [Windows PC]
-  ├── WhisperLiveKit
+  ├── WhisperLiveKit (port 8000)
   │     ├── Whisper large-v3 (transcription)
-  │     └── Streaming Sortformer (real-time speaker diarization)
+  │     └── diart / pyannote (real-time speaker diarization)
   │     WebSocket output: ws://localhost:8000/asr
-  ├── Mosquitto MQTT broker (port 1883)
+  ├── Mosquitto MQTT broker (port 1883, internal bus)
   ├── bridge.py
   │     ├── Subscribes to Whisper WebSocket
   │     ├── Receives: {text, speaker_id, is_final, ...}
-  │     ├── Resolves speaker_id → name via speaker_registry
+  │     ├── Resolves speaker_id → name via speakers.json
   │     ├── Buffers text to sentence boundary
   │     └── Publishes JSON payload to MQTT topic display/text
-  └── admin_ui.py (Tkinter)
-        ├── Shows "New speaker detected" prompts
-        ├── Operator types name once per unknown speaker
-        └── Updates speaker_registry in real time
-
-     ↓ WiFi / MQTT
-[ESP32-S3]
-  └── Waveshare 7.5" V2 e-ink display (SPI, GxEPD2 library)
+  └── admin.py (port 8001)
+        ├── Speaker name management (REST API + web UI)
+        ├── Per-speaker voice sample library
+        ├── Test recording playback
+        ├── Subscribes to MQTT display/text
+        └── /display — fullscreen display page (SSE push to browsers)
+
+     ↓ WiFi (browser on local network)
+[Any tablet / TV / browser-capable device]
+  └── http://[PC-IP]:8001/display   ← fullscreen display page
 ```
 
+Multiple display devices can connect simultaneously.
+
 ---
 
 ## PC Environment
 
 - OS: Windows 10/11
-- GPU: NVIDIA RTX series (RTX 4070 Super available)
-- Python: 3.11+
+- GPU: NVIDIA RTX series (RTX 4070 Super tested)
+- Python: 3.12 (required — PyTorch wheels not published for 3.13+ yet)
 - MQTT broker: Mosquitto (localhost:1883)
-- Whisper server: WhisperLiveKit with `--diarization` flag
-  - Command: `whisperlivekit-server --model large-v3 --diarization --language en`
+- Whisper server: WhisperLiveKit launched via `bridge/whisper_launcher.py`
+  - Command: `python bridge\whisper_launcher.py --model large-v3 --lan en --diarization-backend diart`
   - WebSocket: `ws://localhost:8000/asr`
-- Diarization model: Streaming Sortformer (SOTA 2025, via WhisperLiveKit)
-  - Fallback: Diart (more stable, slightly older, also integrated in WhisperLiveKit)
-  - Requires pyannote model access (HuggingFace token + model agreement)
-
-### WhisperLiveKit Diarization Setup Notes
-- Install with diarization extra: `pip install whisperlivekit[diarization-sortformer]`
-- Sortformer and Voxtral extras are incompatible — install in separate environments
-- Must accept HuggingFace user conditions for:
-  - `pyannote/segmentation`
-  - `pyannote/segmentation-3.0`
-  - `pyannote/embedding`
-- Login: `huggingface-cli login`
-- Streaming Sortformer is marked as in active development — fallback to Diart if unstable
+- Diarization: diart (pyannote.audio streaming), activated via `--diarization-backend diart`
+  - Requires HuggingFace token and accepted licence for `pyannote/speaker-diarization-3.1` and `pyannote/segmentation-3.0`
+  - Sortformer backend exists but requires NVIDIA NeMo — not installed; diart is the active backend
+
+### WhisperLiveKit Launch Notes
+
+`bridge/whisper_launcher.py` must be used instead of `wlk` directly. It applies two patches before loading WhisperLiveKit:
+
+1. **ffmpeg PATH** — adds `imageio-ffmpeg` bundled binary to PATH so `whisperlivekit.ffmpeg_manager` can spawn ffmpeg without a system-wide install
+2. **torchaudio shim** — injects `torchaudio.set_audio_backend = lambda b: None` before diart is imported; diart calls this function at module load time but it was removed in torchaudio 2.x
+
+### CUDA Notes
+
+- PyTorch must be installed from the CUDA index (`--index-url https://download.pytorch.org/whl/cu124`)
+- CUDA Toolkit 12.x must be separately installed from NVIDIA (provides `cublas64_12.dll`)
+- Without CUDA, WhisperLiveKit falls back to CPU; large-v3 on CPU is ~15× slower than real-time — not viable for live services
+- GPU target: RTX 4070 Super runs large-v3 comfortably in real-time
 
 ---
 
-## ESP32 Environment
+## Display Environment
 
-- Board: ESP32-S3 (PSRAM required for large font glyph buffers)
-- Framework: Arduino via PlatformIO
-- Display: Waveshare 7.5" V2 (800×480 pixels, black/white)
-- Display library: GxEPD2
-- MQTT library: PubSubClient (increase buffer: `client.setBufferSize(512)`)
-- Build tool: PlatformIO (VSCode)
+The display is a fullscreen browser page served by `admin.py` at `/display`.
 
-### SPI Wiring (Waveshare 7.5" V2 → ESP32)
+- **URL**: `http://[PC-IP]:8001/display` — open on any tablet, TV, or spare device on the same WiFi
+- **Push mechanism**: Server-Sent Events (SSE) from `admin.py` — `admin.py` subscribes to MQTT and forwards display/text payloads to connected browsers via SSE
+- **Layout**: Speaker name header at top, 3 rolling lines of transcription text below; font scales to screen size
+- **Full-screen**: press F11 in browser (or use guided kiosk mode on tablet)
+- **Multiple simultaneous displays**: each browser is an independent SSE subscriber
 
-| Display Pin | ESP32 Pin |
-|---|---|
-| BUSY | GPIO 4 |
-| RST | GPIO 16 |
-| DC | GPIO 17 |
-| CS | GPIO 5 |
-| CLK | GPIO 18 |
-| DIN | GPIO 23 |
-| GND | GND |
-| VCC | 3.3V |
+No microcontroller, firmware, or hardware assembly is required.
 
 ---
 
 ## MQTT Topics
 
 | Topic | Direction | Payload |
-|---|---|---|
-| `display/text` | PC → ESP32 | JSON: see payload schema below |
-| `display/clear` | PC → ESP32 | Empty / any value |
-| `display/status` | ESP32 → PC | JSON: `{"ready": true}` |
+| --- | --- | --- |
+| `display/text` | bridge → admin.py → display browsers | JSON: see schema below |
+| `display/clear` | bridge → admin.py → display browsers | Empty |
 
 ### display/text Payload Schema
 
 ```json
 {
-  "speaker": "PASTOR JOHN",
-  "speaker_changed": true,
   "lines": [
+    "PASTOR JOHN",
     "...and He said unto them, go",
     "into all the world and preach"
   ]
 }
 ```
 
-- `speaker`: resolved name string, or `null` if unknown/unnamed
-- `speaker_changed`: `true` triggers full display refresh + speaker header redraw
-- `lines`: array of pre-wrapped strings, max 40 chars each, max 3 items
+- `lines`: array of strings, max `DISPLAY_LINES` items (currently 3); speaker name injected as first line on speaker change
+- Bridge pre-wraps text at `MAX_LINE_CHARS` (38) using `textwrap.wrap`
 
 ---
 
 ## Key Files
 
 ### `bridge/bridge.py`
-Main orchestrator. Connects to Whisper WebSocket and Mosquitto. Receives incremental diarized transcription. Buffers text. Resolves speaker names. Publishes MQTT payloads.
 
-**WebSocket message fields from WhisperLiveKit (with diarization):**
-```json
-{
-  "text": "and He said unto them",
-  "speaker": "SPEAKER_0",
-  "is_final": true,
-  "start": 12.4,
-  "end": 15.1
-}
-```
+Main audio pipeline. Headless — no UI. Connects to Whisper WebSocket and Mosquitto.
 
-**Bridge logic:**
-1. On each `is_final` segment, extract `text` and `speaker`
-2. Resolve `speaker` → name via `speaker_registry`
-3. If speaker is unknown, notify `admin_ui` (via queue or callback)
-4. Accumulate text into rolling buffer
-5. On sentence boundary or 4s timeout, word-wrap and publish to MQTT
-6. Set `speaker_changed: true` if speaker differs from last published segment
-
-### `bridge/speaker_registry.py`
-Manages the session-persistent mapping of `SPEAKER_N` IDs to real names.
-
-```python
-# Core interface
-registry = SpeakerRegistry()
-registry.assign(speaker_id="SPEAKER_0", name="Pastor John")
-name = registry.resolve("SPEAKER_0")  # Returns "Pastor John" or None
-registry.is_known("SPEAKER_1")        # Returns False
-registry.save_session()               # Persist to JSON for the session
-```
+**Current state:**
 
-- Session data stored in `bridge/sessions/YYYY-MM-DD.json`
-- v2: will also store voice embeddings per speaker for cross-session recognition
+- `BridgeState` class holds all mutable state (thread-safe via `threading.Lock`)
+- `speaker_names`: dict loaded from `speakers.json`, polled for changes every 5s via `_speaker_reloader()`
+- `push_final()`: accumulates text, detects speaker change, flushes on sentence boundary or timeout
+- `_flush()`: word-wraps with `textwrap.wrap(text, 38)`, maintains 3-line rolling display, injects `[SPEAKER NAME]` label on speaker change, publishes to MQTT
+- `_choose_audio_device()`: lists input devices, respects `AUDIO_DEVICE` config constant
+- Audio path: `sounddevice.InputStream` → asyncio queue → WebSocket chunks to WhisperLiveKit
 
-### `bridge/admin_ui.py`
-Lightweight Tkinter window. Runs in a separate thread alongside bridge.py.
+**Config constants** (top of file):
 
-**Behaviour:**
-- Displays current speaker label and resolved name (or "Unknown")
-- When a new unknown `SPEAKER_N` is detected, shows a prompt: "New speaker detected. Who is this?"
-- Operator types name and hits Enter
-- Calls `registry.assign()` and the display updates immediately
-- Also shows a manual override: operator can retype any name at any time
+- `MQTT_HOST`, `WS_URL`, `SAMPLE_RATE=16000`, `BLOCKSIZE=4096`
+- `SENTENCE_TIMEOUT=4.0`, `MAX_LINE_CHARS=38`, `DISPLAY_LINES=3`
+- `AUDIO_DEVICE=None` — set to an integer index to force a specific microphone
 
-### `esp32/src/main.cpp`
-ESP32 firmware. WiFi + MQTT client. Receives JSON payloads and renders to e-ink.
+### `bridge/admin.py`
 
-**Display rendering logic:**
-- On `speaker_changed: true`: full refresh, print speaker name in large CAPS on line 1, then print text lines below
-- On `speaker_changed: false`: partial refresh, overwrite text lines only (speaker header stays)
-- Track partial refresh count; force full refresh every 10 cycles to clear ghosting
-- Font: large enough for ~40 chars across 800px (approx FreeSans 18–24pt at this resolution)
+FastAPI web server on port 8001. Single-file — HTML/CSS/JS embedded as a Python string.
 
----
+**Endpoints:**
+
+- `GET /` — speaker management web UI
+- `GET|POST /api/speakers` — list / add speakers
+- `PUT|DELETE /api/speakers/{sid}` — rename / remove speaker
+- `POST|GET /api/speakers/{sid}/recording` — upload / serve per-speaker voice sample
+- `POST /api/test/upload` — upload full-service test recording
+- `GET /api/test/files` — list test recordings
+- `DELETE /api/test/files/{filename}` — delete test recording
+- `POST /api/test/start` — stream test recording to WhisperLiveKit (via `_stream_file()`)
+- `POST /api/test/stop` — cancel active playback
+- `GET /api/test/status` — playback progress / state
+- `GET /display` — fullscreen display page *(not yet implemented)*
+- `GET /api/display/stream` — SSE endpoint for display page *(not yet implemented)*
+
+**Test playback**: `_stream_file()` is an asyncio task that reads audio via `miniaudio.stream_file()` (handles WAV/MP3/FLAC/OGG/M4A, resamples to 16kHz mono) and streams chunks to `ws://localhost:8000/asr`, mimicking live microphone input.
+
+### `bridge/whisper_launcher.py`
+
+Startup wrapper for WhisperLiveKit. Applies ffmpeg PATH fix and torchaudio shim, then calls `whisperlivekit.cli.main()`. Used by `start.bat` instead of `wlk` directly.
 
-## Display Layout (800×480 pixels)
+### `bridge/speakers.json`
 
+Auto-created on first run. Format: `{"SPEAKER_00": "Pastor", "SPEAKER_01": "Reader", ...}`. Seeded with 4 defaults. Persists across sessions. Written by both `bridge.py` and `admin.py`; `bridge.py` polls mtime every 5s to pick up admin changes.
+
+### `bridge/requirements.txt`
+
+```text
+paho-mqtt>=2.0
+websockets>=12.0
+sounddevice>=0.4.6
+numpy>=1.24
+fastapi>=0.111
+uvicorn>=0.29
+python-multipart>=0.0.9
+miniaudio>=1.59
+imageio-ffmpeg>=2.9
 ```
-┌────────────────────────────────────────────────┐  ← full width
-│ PASTOR JOHN                                    │  ← speaker name, top ~80px, bold/large
-│────────────────────────────────────────────────│
-│ ...and He said unto them, go into all the      │  ← text line 1
-│ world and preach the gospel to every           │  ← text line 2
-│ creature. He that believeth and is baptised    │  ← text line 3
-└────────────────────────────────────────────────┘
+
+---
+
+## Display Layout
+
+Browser-based, scales to any screen size.
+
+```text
+┌─────────────────────────────────────────────────┐
+│ PASTOR JOHN                                     │  ← speaker name, prominent, top section
+│─────────────────────────────────────────────────│
+│ ...and He said unto them, go into all the       │  ← line 1
+│ world and preach the gospel to every            │  ← line 2
+│ creature. He that believeth and is baptised     │  ← line 3
+└─────────────────────────────────────────────────┘
 ```
 
-- Speaker name zone: top ~80px
-- Text zone: remaining ~380px, 3 lines at ~120px each
-- On speaker change: full clear, redraw both zones
-- On same speaker new text: partial refresh text zone only
+- Font size scales with viewport width — target readability at 3–5 metres
+- High contrast: white text on dark background recommended for bright church environments
+- Speaker name shown only when speaker changes (not repeated per line)
+- Instant update via SSE — no refresh flash
 
 ---
 
 ## Speaker Diarization Notes
 
-### v1 — Operator-Assisted Naming
-- Zero prep before service
-- admin_ui.py shows prompt when new `SPEAKER_N` appears
-- Operator at sound desk types name (e.g. "Pastor John") once
-- Registry holds the mapping for the entire session
+### Active approach — diart (pyannote.audio)
 
-### v2 — Voice Enrolment (future)
-- Record 10–30s of each speaker saying natural speech (not word lists)
-- Extract speaker embedding using pyannote `SpeakerEmbedding` pipeline
-- Store embedding in `bridge/profiles/<name>.npy`
-- At runtime, compare incoming `SPEAKER_N` embedding to stored profiles
-- If cosine similarity > threshold (~0.85), auto-assign name
-- Fall back to operator prompt if no match above threshold
+- Launched via `--diarization-backend diart` in `whisper_launcher.py`
+- torchaudio compatibility shim applied in launcher (set_audio_backend removed in 2.x)
+- Tracks 2–4+ speakers reliably in clean audio conditions
+- Works best with direct mixer feed; background music may confuse diarization
+- Congregation responses ("Amen", "Hallelujah") appear as brief unknown speakers — minimum-duration filter (~2s) before triggering admin alert is a future improvement
 
-### Known Diarization Constraints
-- Streaming Sortformer tracks 2–4+ speakers reliably
-- Works best with clean, low-noise audio — direct mixer feed strongly preferred
-- Background music (worship) may confuse diarization; consider muting music channel on the transcription input
-- Congregation responses ("Amen", "Hallelujah") may appear as brief unknown speakers — consider a minimum-duration filter (~2s) before triggering a speaker prompt
+### v1 — Operator-Assisted Naming (current)
+
+- New `SPEAKER_XX` IDs appear automatically in the admin web table within 5s (via speakers.json polling)
+- Operator types the name in the table row and saves — takes effect immediately
+- Speaker names persist across sessions in speakers.json
+
+### v2 — Voice Enrolment (planned)
+
+- Upload 10–30s clear speech sample per speaker via admin page Voice Sample column (already implemented)
+- Extract embedding using pyannote `SpeakerEmbedding` pipeline
+- At runtime, compare incoming `SPEAKER_N` embedding to stored profiles
+- Auto-assign name if cosine similarity > threshold (~0.85); fall back to operator prompt otherwise
+- Embeddings stored in `bridge/profiles/<name>.npy`
 
 ---
 
 ## Design Constraints & Open Questions
 
-- [ ] Streaming Sortformer stability in WhisperLiveKit — test early; fall back to Diart if needed
-- [ ] Minimum speaker segment duration before triggering name prompt (avoid congregation one-liners)
-- [ ] Partial refresh ghosting — determine optimal full-refresh interval for the chosen display
-- [ ] ESP32-S3 PSRAM: confirm font glyph buffer fits; WROOM (no PSRAM) likely insufficient for large fonts
-- [ ] Word-wrap edge cases: long proper nouns, scripture references, place names
-- [ ] Session save/restore: if PC crashes mid-service, can operator reload speaker assignments quickly?
+- [ ] Display page `/display` not yet built — next major task
+- [ ] SSE push from admin.py to display browsers — requires admin.py to subscribe to MQTT or receive updates from bridge.py via shared state
+- [ ] Minimum speaker segment duration before adding to admin table (avoid congregation one-liners populating 50 rows)
+- [ ] Voice enrolment v2 — pyannote.audio is installed, extraction pipeline not yet written
+- [ ] Word-wrap edge cases: long proper nouns, scripture references
+- [ ] Session save/restore: if PC crashes mid-service, speakers.json persists so names reload immediately on restart
 - [ ] Audio routing on Windows: ensure Whisper receives the mixer/mic channel, not system audio
+- [ ] CUDA Toolkit 12.x installation required for GPU acceleration (cublas64_12.dll)
 
 ---
 
 ## Testing Approach
 
-1. **Whisper standalone**: speak into mic, verify text output in browser at `http://localhost:8000`
-2. **Diarization standalone**: two people alternate speaking, verify `SPEAKER_0` / `SPEAKER_1` labels in WS output
-3. **Registry + bridge**: run bridge.py, verify name prompts appear in admin_ui.py, verify MQTT payloads via `mosquitto_sub -t display/#`
-4. **ESP32 display**: send static MQTT messages manually before connecting bridge
-5. **End-to-end**: full pipeline test with recorded sermon audio (mix of 2–3 speakers)
-6. **In-situ trial**: 1–2 Sunday services with a volunteer congregant providing feedback
+1. **Whisper standalone**: speak into mic, verify text in browser at `http://localhost:8000`
+2. **Diarization**: two people alternate speaking, verify `SPEAKER_00` / `SPEAKER_01` labels in WS output
+3. **Bridge**: run `bridge.py`, verify MQTT payloads via `mosquitto_sub -t display/#`
+4. **Admin**: open `http://localhost:8001`, verify speaker rows appear, rename one, confirm bridge picks up the change within 5s
+5. **Test playback**: upload a full service recording via admin, press Play at 4×, verify transcription appears in MQTT and display
+6. **Display page**: open `http://[PC-IP]:8001/display` on tablet, verify text updates in real time
+7. **In-situ trial**: 1–2 Sunday services with a volunteer congregant providing feedback
+
+---
+
+## Development Sequence (Remaining)
+
+1. Build `/display` fullscreen browser page in `admin.py`
+2. Add SSE endpoint (`/api/display/stream`) in `admin.py` — subscribe to MQTT, push payloads to browsers
+3. Style display page: large font, dark background, speaker header, 3-line rolling text
+4. Install CUDA Toolkit 12.x on the production PC to enable GPU acceleration
+5. Voice enrolment v2 — extract pyannote embeddings from uploaded samples, add matching logic to bridge
+6. Church deployment trial
 
 ---
 
-## Development Sequence (Suggested)
+## Further Enhancements
 
-1. Get WhisperLiveKit running with `--diarization` flag, confirm WS output includes speaker labels
-2. Write `bridge.py` (transcription only, no diarization yet) → verify MQTT publish works
-3. Add `speaker_registry.py` and `admin_ui.py` → test name mapping loop
-4. Integrate diarization into bridge — handle `speaker_changed` logic
-5. Write ESP32 firmware — basic text display
-6. Add speaker header zone and refresh logic to ESP32 firmware
-7. Full end-to-end test on bench
-8. Church trial
+- Convert speakers.json to remote database for multi-event / multi-location usage
+- Transcription log: table of speaker name + first sentence of each turn, exportable after the service
+- Minimum-duration filter: suppress `SPEAKER_XX` rows for segments under ~2s (congregation responses)

+ 89 - 91
README.md

@@ -1,23 +1,26 @@
 # Church Live Transcription Display
 
-A live speech-to-text system for deaf and hard-of-hearing congregants, displaying real-time transcriptions with speaker identification on an e-ink screen driven by an ESP32 microcontroller.
+A live speech-to-text system for deaf and hard-of-hearing congregants, displaying real-time transcriptions with speaker identification on any tablet, screen, or browser-capable device.
 
 ## Overview
 
-Audio from the church service is captured on a Windows PC, transcribed and speaker-diarized locally using WhisperLiveKit, and the resulting speaker-tagged text is pushed over WiFi/MQTT to an ESP32 that drives a large e-ink display. The display shows who is speaking alongside what they are saying, updates in real time, and requires no internet connection.
+Audio from the church service is captured on a Windows PC, transcribed and speaker-diarized locally using WhisperLiveKit, and the resulting speaker-tagged text is served to a fullscreen browser page. Any tablet, TV, or device with a browser on the same WiFi can act as the display — no custom hardware required.
 
 ```
 [Microphone / Mixer]
 [Windows PC]
   ├── WhisperLiveKit  (transcription + speaker diarization)
-  ├── Mosquitto MQTT broker
+  ├── Mosquitto MQTT broker (internal message bus)
   ├── bridge.py       (WebSocket → name mapping → MQTT)
-  └── Speaker Admin UI (operator names speakers live)
-        ↓ MQTT / WiFi
-[ESP32 + e-ink display]
+  └── admin.py        (speaker management + fullscreen display page)
+        ↓ WiFi (browser)
+[Tablet / TV / Any device with a browser]
+  └── http://[PC-IP]:8001/display   ← fullscreen display page
 ```
 
+Multiple display devices can be open simultaneously — useful for front-of-church and hearing-loop desk simultaneously.
+
 ## Goals
 
 - Real-time captions with minimal latency (target: < 3 seconds end-to-end)
@@ -25,8 +28,8 @@ Audio from the church service is captured on a Windows PC, transcribed and speak
 - Named speakers — operator maps anonymous speaker IDs to real names during service
 - Future: voice enrolment so names are matched automatically from pre-recorded samples
 - Runs entirely on local network — no cloud dependency
-- Readable at distance with large font (36–48pt equivalent)
-- Low cost, low complexity hardware
+- Readable at distance — large, auto-scaling font
+- No custom hardware — any spare tablet or screen works as the display
 
 ---
 
@@ -34,106 +37,104 @@ Audio from the church service is captured on a Windows PC, transcribed and speak
 
 ### How It Works
 
-WhisperLiveKit includes **Streaming Sortformer** (SOTA 2025), a real-time speaker diarization model developed by NVIDIA. It runs alongside Whisper transcription and tags each segment of speech with an anonymous speaker label (`SPEAKER_0`, `SPEAKER_1`, etc.).
+WhisperLiveKit includes built-in real-time speaker diarization via **diart** (pyannote.audio). It runs alongside Whisper transcription and tags each segment of speech with an anonymous speaker label (`SPEAKER_00`, `SPEAKER_01`, etc.).
 
-A name mapping layer in the bridge script translates these anonymous labels into real names, which are then included in the MQTT payload sent to the display.
+A name mapping layer in the bridge script translates these labels into real names, which are then pushed to the display page.
 
 ### Display Format
 
-When a speaker changes, their name is shown as a header line above their words. The name is not repeated on every line — only when the speaker changes.
+When a speaker changes, their name is shown as a header line above their words.
 
 ```
-┌─────────────────────────────────┐
-│ PASTOR JOHN                     │
-│ ...and He said unto them, go    │
-│ into all the world and preach   │
-└─────────────────────────────────┘
-
-┌─────────────────────────────────┐
-│ MARY (READER)                   │
-│ A reading from Luke chapter 4,  │
-│ verse 18...                     │
-└─────────────────────────────────┘
+┌─────────────────────────────────────────┐
+│ PASTOR JOHN                             │
+│                                         │
+│ ...and He said unto them, go into       │
+│ all the world and preach the gospel     │
+│ to every creature.                      │
+└─────────────────────────────────────────┘
+
+  (speaker changes)
+
+┌─────────────────────────────────────────┐
+│ MARY (READER)                           │
+│                                         │
+│ A reading from Luke chapter 4,          │
+│ verse 18...                             │
+└─────────────────────────────────────────┘
 ```
 
 ### Speaker Naming — Two Approaches
 
-**v1 — Operator-Assisted Naming (implemented first)**
+#### v1 — Operator-Assisted Naming (implemented)
 
-A simple admin UI runs on the PC alongside the bridge script. When a new unknown speaker is detected, the operator sees a prompt ("New speaker detected — who is this?") and types the name once. That name is stored for the session and used every time that speaker is detected again.
+The speaker admin page at `http://[PC-IP]:8001` shows all detected speakers. When a new `SPEAKER_XX` appears, the operator types the name once. That name is stored persistently and used for every future session.
 
 - No setup required before the service
 - Works from the very first Sunday
 - Operator (e.g. sound desk volunteer) assigns names as speakers appear
 
-**v2 — Voice Enrolment (future upgrade)**
+#### v2 — Voice Enrolment (planned)
 
-Before the service, a short voice sample (10–30 seconds) is recorded for each expected speaker. The bridge script compares incoming speaker embeddings against enrolled voices and automatically assigns the correct name without operator input.
+Before the service, a short voice sample (10–30 seconds) is uploaded for each expected speaker via the admin page. The bridge compares incoming speaker embeddings against enrolled voices and automatically assigns the correct name without operator input.
 
 - No operator intervention during the service
 - More accurate for recurring speakers (pastor, regular readers)
 - Enrolled voice profiles persist week to week
 
-### Typical Church Speakers
-
-| Role | Frequency | Notes |
-|---|---|---|
-| Pastor / Preacher | Every service | Primary speaker, longest segments |
-| Worship leader | Most services | May overlap with congregation response |
-| Reader / Scripture | Weekly | Short, distinct segments |
-| Visiting speaker | Occasionally | New enrolment or operator naming needed |
-| Announcements | Weekly | Often the same person each week |
-
 ---
 
 ## System Components
 
 ### PC Side (Windows)
-- **WhisperLiveKit** — local GPU-accelerated transcription + diarization server
-- **Mosquitto** — lightweight MQTT broker (same PC, port 1883)
+
+- **WhisperLiveKit** — local GPU-accelerated transcription + diarization server (port 8000)
+- **Mosquitto** — lightweight MQTT broker (internal message bus, port 1883)
 - **bridge.py** — WebSocket subscriber, speaker name mapper, MQTT publisher
-- **admin_ui.py** — lightweight operator interface for live speaker naming
-- **speaker_registry.py** — manages speaker ID ↔ name mappings and voice enrolment
+- **admin.py** — web server (port 8001) providing:
+  - Speaker name management (40–50 speaker table)
+  - Voice sample upload and playback
+  - Test recording playback for offline pipeline testing
+  - `/display` — fullscreen display page for tablets and screens
+
+### Display Side
 
-### ESP32 Side
-- **ESP32-S3** — WiFi microcontroller (S3 preferred — PSRAM needed for large font bitmaps)
-- **Waveshare e-ink display** — 7.5" V2 (800×480) or larger
-- **GxEPD2** — display driver library
-- **PubSubClient** — MQTT client library
+- **Any device with a browser** — tablet, phone, Smart TV, laptop, HDMI monitor + PC
+- Navigate to `http://[PC-IP]:8001/display` and go fullscreen (`F11`)
+- Font and layout scale automatically to the screen size
+- Multiple display devices can be open simultaneously
 
 ---
 
 ## Hardware
 
-| Component | Model | Notes |
-|---|---|---|
-| Microcontroller | ESP32-S3 | PSRAM required for large font bitmaps |
-| Display | Waveshare 7.5" V2 e-Paper | 800×480, supports partial refresh |
-| PC | Windows 10/11 with NVIDIA GPU | RTX series recommended |
-| Microphone | USB condenser or direct mixer feed | Mixer feed preferred for clean diarization |
+| Component | Notes |
+|---|---|
+| Windows PC with NVIDIA GPU | RTX series recommended; RTX 4070 Super tested |
+| Microphone or mixer line-in | USB condenser or direct mixer feed; mixer preferred for clean diarization |
+| Display device | Any tablet, TV, or spare PC with a browser on the same WiFi |
+
+No microcontroller, no custom firmware, no hardware assembly required.
 
 ---
 
 ## Key Design Decisions
 
 ### Text & Speaker Buffering
-The bridge script accumulates text until a sentence boundary or natural pause (~4s), then checks whether the speaker has changed. If the speaker is unchanged, only new text lines are pushed. If the speaker has changed, a full new payload is sent including the speaker name header, triggering a full display refresh.
+The bridge script accumulates text until a sentence boundary or natural pause (~4s), then checks whether the speaker has changed. On speaker change, a full new payload is pushed to the display including the speaker name header. Otherwise only the new text lines are pushed.
 
 ### Display Layout
-- **Line 1:** Speaker name in CAPS — printed only on speaker change
-- **Lines 2–4:** Rolling transcription text, wrapping at ~40 chars per line
-- On speaker change: full screen clear then redraw with new name header
-- Font targets readability at 3–5 metres
 
-### E-ink Refresh Strategy
-- Speaker change → **full refresh** (~1.5s flash — clean slate, acceptable at speaker transition)
-- Same speaker, new text → **partial refresh** (~300ms, minor ghosting)
-- Force full refresh every 10 partial refreshes to clear accumulated ghosting
+- **Speaker name** — shown in CAPS at the top, updated only when the speaker changes
+- **Rolling text** — 3 lines of word-wrapped transcription text below
+- Font scales to the display device's screen size
+- No refresh artifacts — instant update (unlike e-ink)
 
 ### Network
 - All traffic on local WiFi (church LAN or dedicated hotspot)
-- MQTT broker on Windows PC (port 1883)
-- Static IP recommended for ESP32 to avoid reconnection delays
+- MQTT broker on Windows PC (port 1883, internal use)
+- The display page connects to the PC's admin server (port 8001) via the local network
+- PC static IP recommended to avoid having to update the URL on tablets
 
 ---
 
@@ -143,45 +144,42 @@ The bridge script accumulates text until a sentence boundary or natural pause (~
 /
 ├── README.md                     — This file
 ├── CLAUDE.md                     — AI assistant context for development sessions
-├── bridge/
-│   ├── bridge.py                 — Main bridge: Whisper WS → name map → MQTT
-│   ├── speaker_registry.py       — Speaker ID ↔ name mapping and voice enrolment
-│   └── admin_ui.py               — Operator UI for live speaker naming (Tkinter)
-├── esp32/
-│   ├── src/
-│   │   └── main.cpp              — ESP32 Arduino firmware
-│   └── platformio.ini            — PlatformIO build config
-└── docs/
-    ├── hardware-wiring.md        — SPI pin connections for Waveshare display
-    ├── setup.md                  — Installation and configuration guide
-    └── speaker-enrolment.md     — Guide for recording and enrolling voice samples (v2)
+├── SETUP.md                      — Installation and configuration guide
+├── install.bat                   — One-time setup script
+├── start.bat                     — Launch script (double-click to start)
+└── bridge/
+    ├── bridge.py                 — Main bridge: Whisper WS → name map → MQTT
+    ├── admin.py                  — Speaker admin + display web server (port 8001)
+    ├── whisper_launcher.py       — WhisperLiveKit startup wrapper (diart patch)
+    ├── requirements.txt          — Python dependencies
+    ├── speakers.json             — Persistent speaker name mappings (auto-created)
+    ├── recordings/               — Per-speaker voice samples (auto-created)
+    └── test_recordings/          — Full-service recordings for pipeline testing
 ```
 
 ---
 
 ## Reference Projects
 
-- [WhisperLiveKit](https://github.com/QuentinFuxa/WhisperLiveKit) — real-time Whisper + Streaming Sortformer diarization
-- [NVIDIA Streaming Sortformer](https://developer.nvidia.com/blog/identify-speakers-in-meetings-calls-and-voice-apps-in-real-time-with-nvidia-streaming-sortformer/) — the diarization model integrated into WhisperLiveKit
-- [pyannote.audio](https://github.com/pyannote/pyannote-audio) — fallback diarization (Diart integration in WhisperLiveKit)
-- [denwilliams/mqtt-epaper](https://github.com/denwilliams/mqtt-epaper) — ESP32 e-paper display driven by MQTT
-- [cuci90/epaper_mqtt_esp32](https://github.com/cuci90/epaper_mqtt_esp32) — ESP32 Waveshare display MQTT template
+- [WhisperLiveKit](https://github.com/QuentinFuxa/WhisperLiveKit) — real-time Whisper + speaker diarization
+- [pyannote.audio / diart](https://github.com/pyannote/pyannote-audio) — streaming speaker diarization
+- [NVIDIA Streaming Sortformer](https://developer.nvidia.com/blog/identify-speakers-in-meetings-calls-and-voice-apps-in-real-time-with-nvidia-streaming-sortformer/) — alternative diarization backend
 
 ---
 
 ## Status
 
-🟡 **Planning / Research phase**
+🟡 **In development — core pipeline functional**
 
 - [x] Architecture defined
-- [x] Speaker diarization approach selected (WhisperLiveKit + Streaming Sortformer)
-- [x] Speaker naming strategy defined (operator-assisted v1, voice enrolment v2)
-- [ ] Python bridge script (transcription only)
-- [ ] Speaker name mapping layer (`speaker_registry.py`)
-- [ ] Operator admin UI (`admin_ui.py`)
-- [ ] ESP32 firmware — basic text display
-- [ ] ESP32 firmware — speaker header layout + refresh logic
-- [ ] Hardware wiring and bench test
-- [ ] End-to-end integration test
-- [ ] Voice enrolment system (v2)
-- [ ] Church deployment trial
+- [x] WhisperLiveKit + diart diarization working
+- [x] bridge.py — transcription → MQTT pipeline
+- [x] admin.py — web speaker management (40–50 speakers)
+- [x] Persistent speaker name storage (speakers.json)
+- [x] Per-speaker voice sample upload
+- [x] Test recording playback for offline pipeline testing
+- [x] install.bat / start.bat — double-click operation
+- [ ] `/display` fullscreen browser display page
+- [ ] SSE or WebSocket push from admin.py to display page
+- [ ] Voice enrolment v2 (auto name matching from voice samples)
+- [ ] Church deployment trial

+ 1 - 0
bridge/requirements.txt

@@ -6,3 +6,4 @@ fastapi>=0.111
 uvicorn>=0.29
 python-multipart>=0.0.9
 miniaudio>=1.59
+imageio-ffmpeg>=2.9

BIN
bridge/test_recordings/260218 Hobart R BDH.mp3


+ 46 - 0
bridge/whisper_launcher.py

@@ -0,0 +1,46 @@
+#!/usr/bin/env python3
+"""
+whisper_launcher.py — WhisperLiveKit startup wrapper
+
+Applies two compatibility fixes before importing WhisperLiveKit:
+
+  1. FFmpeg path — adds imageio-ffmpeg's bundled binary to PATH so
+     WhisperLiveKit's ffmpeg_manager can find ffmpeg without a system
+     install.  If imageio-ffmpeg is absent it falls back to the system PATH.
+
+  2. torchaudio shim — diart/audio.py calls torchaudio.set_audio_backend()
+     at import time, which was removed in torchaudio 2.x.  We inject a no-op
+     before WhisperLiveKit (and therefore diart) is imported.
+"""
+
+import os
+import sys
+
+# ── Fix 1: make ffmpeg available via imageio-ffmpeg bundled binary ────────────
+
+try:
+    import imageio_ffmpeg  # type: ignore
+    _ffmpeg_exe = imageio_ffmpeg.get_ffmpeg_exe()
+    _ffmpeg_dir = os.path.dirname(_ffmpeg_exe)
+    os.environ["PATH"] = _ffmpeg_dir + os.pathsep + os.environ.get("PATH", "")
+    print(f"[Launcher] ffmpeg: {_ffmpeg_exe}")
+except ImportError:
+    print("[Launcher] WARNING: imageio-ffmpeg not installed — system ffmpeg must be in PATH.")
+    print("[Launcher]          Run:  pip install imageio-ffmpeg")
+except Exception as exc:
+    print(f"[Launcher] WARNING: could not locate imageio-ffmpeg binary: {exc}")
+
+# ── Fix 2: diart torchaudio compatibility shim ────────────────────────────────
+
+try:
+    import torchaudio  # type: ignore
+    if not hasattr(torchaudio, "set_audio_backend"):
+        torchaudio.set_audio_backend = lambda backend: None
+        print("[Launcher] Patched torchaudio.set_audio_backend (diart compatibility)")
+except ImportError:
+    pass  # torchaudio not present; WhisperLiveKit will report the error itself
+
+# ── Start WhisperLiveKit ──────────────────────────────────────────────────────
+
+from whisperlivekit.cli import main  # type: ignore
+sys.exit(main())

+ 11 - 7
start.bat

@@ -60,18 +60,22 @@ echo ============================================================
 echo  Church Live Transcription Display
 echo ============================================================
 echo.
-echo Starting Whisper server in a new window...
+echo Starting Whisper server ^(with speaker diarization^)...
 echo Starting bridge in a new window...
+echo Starting speaker admin in a new window...
 echo.
-echo Both windows must stay open during the service.
-echo Close this window or both others to shut down.
+echo All three windows must stay open during the service.
+echo.
+echo NOTE: First run downloads diarization models ^(~500 MB^).
+echo       Wait for "Server running" before speaking.
 echo.
 
-:: Activate venv and launch WhisperLiveKit in its own window
-start "Whisper Transcription Server" cmd /k "call .venv\Scripts\activate.bat && set HF_TOKEN=%HF_TOKEN% && echo Starting WhisperLiveKit (%WHISPER_MODEL%)... && wlk --model %WHISPER_MODEL% --lan en"
+:: Activate venv and launch WhisperLiveKit via the compatibility launcher
+:: The launcher patches torchaudio for diart and makes ffmpeg available.
+start "Whisper Transcription Server" cmd /k "call .venv\Scripts\activate.bat && set HF_TOKEN=%HF_TOKEN% && echo Starting WhisperLiveKit (%WHISPER_MODEL%) with speaker diarization... && python bridge\whisper_launcher.py --model %WHISPER_MODEL% --lan en --diarization-backend diart"
 
-:: Brief pause so Whisper can begin loading before the bridge connects
-timeout /t 5 /nobreak >nul
+:: Give Whisper more time on first run — diarization model downloads ~500 MB
+timeout /t 15 /nobreak >nul
 
 :: Activate venv and launch the bridge (headless audio pipeline)
 start "Transcription Bridge" cmd /k "call .venv\Scripts\activate.bat && echo Starting bridge... && python bridge\bridge.py"