|
@@ -6,251 +6,264 @@ This file provides context for AI-assisted development sessions on the Church Li
|
|
|
|
|
|
|
|
## Project Summary
|
|
## Project Summary
|
|
|
|
|
|
|
|
-A live captioning system for deaf/hard-of-hearing church congregants. A Windows PC captures audio, transcribes it locally using Whisper (GPU-accelerated), performs real-time speaker diarization, maps anonymous speaker IDs to real names, and sends speaker-tagged rolling text over MQTT to an ESP32 driving a large e-ink display. No cloud services. No internet required during operation.
|
|
|
|
|
|
|
+A live captioning system for deaf/hard-of-hearing church congregants. A Windows PC captures audio, transcribes it locally using Whisper (GPU-accelerated), performs real-time speaker diarization, maps anonymous speaker IDs to real names, and serves a fullscreen display page over the local WiFi network. Any tablet, TV, or browser-capable device on the same network can act as the display. No cloud services. No internet required during operation.
|
|
|
|
|
|
|
|
---
|
|
---
|
|
|
|
|
|
|
|
## Architecture
|
|
## Architecture
|
|
|
|
|
|
|
|
-```
|
|
|
|
|
|
|
+```text
|
|
|
[Audio source]
|
|
[Audio source]
|
|
|
↓ (USB mic or mixer line-in)
|
|
↓ (USB mic or mixer line-in)
|
|
|
[Windows PC]
|
|
[Windows PC]
|
|
|
- ├── WhisperLiveKit
|
|
|
|
|
|
|
+ ├── WhisperLiveKit (port 8000)
|
|
|
│ ├── Whisper large-v3 (transcription)
|
|
│ ├── Whisper large-v3 (transcription)
|
|
|
- │ └── Streaming Sortformer (real-time speaker diarization)
|
|
|
|
|
|
|
+ │ └── diart / pyannote (real-time speaker diarization)
|
|
|
│ WebSocket output: ws://localhost:8000/asr
|
|
│ WebSocket output: ws://localhost:8000/asr
|
|
|
│
|
|
│
|
|
|
- ├── Mosquitto MQTT broker (port 1883)
|
|
|
|
|
|
|
+ ├── Mosquitto MQTT broker (port 1883, internal bus)
|
|
|
│
|
|
│
|
|
|
├── bridge.py
|
|
├── bridge.py
|
|
|
│ ├── Subscribes to Whisper WebSocket
|
|
│ ├── Subscribes to Whisper WebSocket
|
|
|
│ ├── Receives: {text, speaker_id, is_final, ...}
|
|
│ ├── Receives: {text, speaker_id, is_final, ...}
|
|
|
- │ ├── Resolves speaker_id → name via speaker_registry
|
|
|
|
|
|
|
+ │ ├── Resolves speaker_id → name via speakers.json
|
|
|
│ ├── Buffers text to sentence boundary
|
|
│ ├── Buffers text to sentence boundary
|
|
|
│ └── Publishes JSON payload to MQTT topic display/text
|
|
│ └── Publishes JSON payload to MQTT topic display/text
|
|
|
│
|
|
│
|
|
|
- └── admin_ui.py (Tkinter)
|
|
|
|
|
- ├── Shows "New speaker detected" prompts
|
|
|
|
|
- ├── Operator types name once per unknown speaker
|
|
|
|
|
- └── Updates speaker_registry in real time
|
|
|
|
|
-
|
|
|
|
|
- ↓ WiFi / MQTT
|
|
|
|
|
-[ESP32-S3]
|
|
|
|
|
- └── Waveshare 7.5" V2 e-ink display (SPI, GxEPD2 library)
|
|
|
|
|
|
|
+ └── admin.py (port 8001)
|
|
|
|
|
+ ├── Speaker name management (REST API + web UI)
|
|
|
|
|
+ ├── Per-speaker voice sample library
|
|
|
|
|
+ ├── Test recording playback
|
|
|
|
|
+ ├── Subscribes to MQTT display/text
|
|
|
|
|
+ └── /display — fullscreen display page (SSE push to browsers)
|
|
|
|
|
+
|
|
|
|
|
+ ↓ WiFi (browser on local network)
|
|
|
|
|
+[Any tablet / TV / browser-capable device]
|
|
|
|
|
+ └── http://[PC-IP]:8001/display ← fullscreen display page
|
|
|
```
|
|
```
|
|
|
|
|
|
|
|
|
|
+Multiple display devices can connect simultaneously.
|
|
|
|
|
+
|
|
|
---
|
|
---
|
|
|
|
|
|
|
|
## PC Environment
|
|
## PC Environment
|
|
|
|
|
|
|
|
- OS: Windows 10/11
|
|
- OS: Windows 10/11
|
|
|
-- GPU: NVIDIA RTX series (RTX 4070 Super available)
|
|
|
|
|
-- Python: 3.11+
|
|
|
|
|
|
|
+- GPU: NVIDIA RTX series (RTX 4070 Super tested)
|
|
|
|
|
+- Python: 3.12 (required — PyTorch wheels not published for 3.13+ yet)
|
|
|
- MQTT broker: Mosquitto (localhost:1883)
|
|
- MQTT broker: Mosquitto (localhost:1883)
|
|
|
-- Whisper server: WhisperLiveKit with `--diarization` flag
|
|
|
|
|
- - Command: `whisperlivekit-server --model large-v3 --diarization --language en`
|
|
|
|
|
|
|
+- Whisper server: WhisperLiveKit launched via `bridge/whisper_launcher.py`
|
|
|
|
|
+ - Command: `python bridge\whisper_launcher.py --model large-v3 --lan en --diarization-backend diart`
|
|
|
- WebSocket: `ws://localhost:8000/asr`
|
|
- WebSocket: `ws://localhost:8000/asr`
|
|
|
-- Diarization model: Streaming Sortformer (SOTA 2025, via WhisperLiveKit)
|
|
|
|
|
- - Fallback: Diart (more stable, slightly older, also integrated in WhisperLiveKit)
|
|
|
|
|
- - Requires pyannote model access (HuggingFace token + model agreement)
|
|
|
|
|
-
|
|
|
|
|
-### WhisperLiveKit Diarization Setup Notes
|
|
|
|
|
-- Install with diarization extra: `pip install whisperlivekit[diarization-sortformer]`
|
|
|
|
|
-- Sortformer and Voxtral extras are incompatible — install in separate environments
|
|
|
|
|
-- Must accept HuggingFace user conditions for:
|
|
|
|
|
- - `pyannote/segmentation`
|
|
|
|
|
- - `pyannote/segmentation-3.0`
|
|
|
|
|
- - `pyannote/embedding`
|
|
|
|
|
-- Login: `huggingface-cli login`
|
|
|
|
|
-- Streaming Sortformer is marked as in active development — fallback to Diart if unstable
|
|
|
|
|
|
|
+- Diarization: diart (pyannote.audio streaming), activated via `--diarization-backend diart`
|
|
|
|
|
+ - Requires HuggingFace token and accepted licence for `pyannote/speaker-diarization-3.1` and `pyannote/segmentation-3.0`
|
|
|
|
|
+ - Sortformer backend exists but requires NVIDIA NeMo — not installed; diart is the active backend
|
|
|
|
|
+
|
|
|
|
|
+### WhisperLiveKit Launch Notes
|
|
|
|
|
+
|
|
|
|
|
+`bridge/whisper_launcher.py` must be used instead of `wlk` directly. It applies two patches before loading WhisperLiveKit:
|
|
|
|
|
+
|
|
|
|
|
+1. **ffmpeg PATH** — adds `imageio-ffmpeg` bundled binary to PATH so `whisperlivekit.ffmpeg_manager` can spawn ffmpeg without a system-wide install
|
|
|
|
|
+2. **torchaudio shim** — injects `torchaudio.set_audio_backend = lambda b: None` before diart is imported; diart calls this function at module load time but it was removed in torchaudio 2.x
|
|
|
|
|
+
|
|
|
|
|
+### CUDA Notes
|
|
|
|
|
+
|
|
|
|
|
+- PyTorch must be installed from the CUDA index (`--index-url https://download.pytorch.org/whl/cu124`)
|
|
|
|
|
+- CUDA Toolkit 12.x must be separately installed from NVIDIA (provides `cublas64_12.dll`)
|
|
|
|
|
+- Without CUDA, WhisperLiveKit falls back to CPU; large-v3 on CPU is ~15× slower than real-time — not viable for live services
|
|
|
|
|
+- GPU target: RTX 4070 Super runs large-v3 comfortably in real-time
|
|
|
|
|
|
|
|
---
|
|
---
|
|
|
|
|
|
|
|
-## ESP32 Environment
|
|
|
|
|
|
|
+## Display Environment
|
|
|
|
|
|
|
|
-- Board: ESP32-S3 (PSRAM required for large font glyph buffers)
|
|
|
|
|
-- Framework: Arduino via PlatformIO
|
|
|
|
|
-- Display: Waveshare 7.5" V2 (800×480 pixels, black/white)
|
|
|
|
|
-- Display library: GxEPD2
|
|
|
|
|
-- MQTT library: PubSubClient (increase buffer: `client.setBufferSize(512)`)
|
|
|
|
|
-- Build tool: PlatformIO (VSCode)
|
|
|
|
|
|
|
+The display is a fullscreen browser page served by `admin.py` at `/display`.
|
|
|
|
|
|
|
|
-### SPI Wiring (Waveshare 7.5" V2 → ESP32)
|
|
|
|
|
|
|
+- **URL**: `http://[PC-IP]:8001/display` — open on any tablet, TV, or spare device on the same WiFi
|
|
|
|
|
+- **Push mechanism**: Server-Sent Events (SSE) from `admin.py` — `admin.py` subscribes to MQTT and forwards display/text payloads to connected browsers via SSE
|
|
|
|
|
+- **Layout**: Speaker name header at top, 3 rolling lines of transcription text below; font scales to screen size
|
|
|
|
|
+- **Full-screen**: press F11 in browser (or use guided kiosk mode on tablet)
|
|
|
|
|
+- **Multiple simultaneous displays**: each browser is an independent SSE subscriber
|
|
|
|
|
|
|
|
-| Display Pin | ESP32 Pin |
|
|
|
|
|
-|---|---|
|
|
|
|
|
-| BUSY | GPIO 4 |
|
|
|
|
|
-| RST | GPIO 16 |
|
|
|
|
|
-| DC | GPIO 17 |
|
|
|
|
|
-| CS | GPIO 5 |
|
|
|
|
|
-| CLK | GPIO 18 |
|
|
|
|
|
-| DIN | GPIO 23 |
|
|
|
|
|
-| GND | GND |
|
|
|
|
|
-| VCC | 3.3V |
|
|
|
|
|
|
|
+No microcontroller, firmware, or hardware assembly is required.
|
|
|
|
|
|
|
|
---
|
|
---
|
|
|
|
|
|
|
|
## MQTT Topics
|
|
## MQTT Topics
|
|
|
|
|
|
|
|
| Topic | Direction | Payload |
|
|
| Topic | Direction | Payload |
|
|
|
-|---|---|---|
|
|
|
|
|
-| `display/text` | PC → ESP32 | JSON: see payload schema below |
|
|
|
|
|
-| `display/clear` | PC → ESP32 | Empty / any value |
|
|
|
|
|
-| `display/status` | ESP32 → PC | JSON: `{"ready": true}` |
|
|
|
|
|
|
|
+| --- | --- | --- |
|
|
|
|
|
+| `display/text` | bridge → admin.py → display browsers | JSON: see schema below |
|
|
|
|
|
+| `display/clear` | bridge → admin.py → display browsers | Empty |
|
|
|
|
|
|
|
|
### display/text Payload Schema
|
|
### display/text Payload Schema
|
|
|
|
|
|
|
|
```json
|
|
```json
|
|
|
{
|
|
{
|
|
|
- "speaker": "PASTOR JOHN",
|
|
|
|
|
- "speaker_changed": true,
|
|
|
|
|
"lines": [
|
|
"lines": [
|
|
|
|
|
+ "PASTOR JOHN",
|
|
|
"...and He said unto them, go",
|
|
"...and He said unto them, go",
|
|
|
"into all the world and preach"
|
|
"into all the world and preach"
|
|
|
]
|
|
]
|
|
|
}
|
|
}
|
|
|
```
|
|
```
|
|
|
|
|
|
|
|
-- `speaker`: resolved name string, or `null` if unknown/unnamed
|
|
|
|
|
-- `speaker_changed`: `true` triggers full display refresh + speaker header redraw
|
|
|
|
|
-- `lines`: array of pre-wrapped strings, max 40 chars each, max 3 items
|
|
|
|
|
|
|
+- `lines`: array of strings, max `DISPLAY_LINES` items (currently 3); speaker name injected as first line on speaker change
|
|
|
|
|
+- Bridge pre-wraps text at `MAX_LINE_CHARS` (38) using `textwrap.wrap`
|
|
|
|
|
|
|
|
---
|
|
---
|
|
|
|
|
|
|
|
## Key Files
|
|
## Key Files
|
|
|
|
|
|
|
|
### `bridge/bridge.py`
|
|
### `bridge/bridge.py`
|
|
|
-Main orchestrator. Connects to Whisper WebSocket and Mosquitto. Receives incremental diarized transcription. Buffers text. Resolves speaker names. Publishes MQTT payloads.
|
|
|
|
|
|
|
|
|
|
-**WebSocket message fields from WhisperLiveKit (with diarization):**
|
|
|
|
|
-```json
|
|
|
|
|
-{
|
|
|
|
|
- "text": "and He said unto them",
|
|
|
|
|
- "speaker": "SPEAKER_0",
|
|
|
|
|
- "is_final": true,
|
|
|
|
|
- "start": 12.4,
|
|
|
|
|
- "end": 15.1
|
|
|
|
|
-}
|
|
|
|
|
-```
|
|
|
|
|
|
|
+Main audio pipeline. Headless — no UI. Connects to Whisper WebSocket and Mosquitto.
|
|
|
|
|
|
|
|
-**Bridge logic:**
|
|
|
|
|
-1. On each `is_final` segment, extract `text` and `speaker`
|
|
|
|
|
-2. Resolve `speaker` → name via `speaker_registry`
|
|
|
|
|
-3. If speaker is unknown, notify `admin_ui` (via queue or callback)
|
|
|
|
|
-4. Accumulate text into rolling buffer
|
|
|
|
|
-5. On sentence boundary or 4s timeout, word-wrap and publish to MQTT
|
|
|
|
|
-6. Set `speaker_changed: true` if speaker differs from last published segment
|
|
|
|
|
-
|
|
|
|
|
-### `bridge/speaker_registry.py`
|
|
|
|
|
-Manages the session-persistent mapping of `SPEAKER_N` IDs to real names.
|
|
|
|
|
-
|
|
|
|
|
-```python
|
|
|
|
|
-# Core interface
|
|
|
|
|
-registry = SpeakerRegistry()
|
|
|
|
|
-registry.assign(speaker_id="SPEAKER_0", name="Pastor John")
|
|
|
|
|
-name = registry.resolve("SPEAKER_0") # Returns "Pastor John" or None
|
|
|
|
|
-registry.is_known("SPEAKER_1") # Returns False
|
|
|
|
|
-registry.save_session() # Persist to JSON for the session
|
|
|
|
|
-```
|
|
|
|
|
|
|
+**Current state:**
|
|
|
|
|
|
|
|
-- Session data stored in `bridge/sessions/YYYY-MM-DD.json`
|
|
|
|
|
-- v2: will also store voice embeddings per speaker for cross-session recognition
|
|
|
|
|
|
|
+- `BridgeState` class holds all mutable state (thread-safe via `threading.Lock`)
|
|
|
|
|
+- `speaker_names`: dict loaded from `speakers.json`, polled for changes every 5s via `_speaker_reloader()`
|
|
|
|
|
+- `push_final()`: accumulates text, detects speaker change, flushes on sentence boundary or timeout
|
|
|
|
|
+- `_flush()`: word-wraps with `textwrap.wrap(text, 38)`, maintains 3-line rolling display, injects `[SPEAKER NAME]` label on speaker change, publishes to MQTT
|
|
|
|
|
+- `_choose_audio_device()`: lists input devices, respects `AUDIO_DEVICE` config constant
|
|
|
|
|
+- Audio path: `sounddevice.InputStream` → asyncio queue → WebSocket chunks to WhisperLiveKit
|
|
|
|
|
|
|
|
-### `bridge/admin_ui.py`
|
|
|
|
|
-Lightweight Tkinter window. Runs in a separate thread alongside bridge.py.
|
|
|
|
|
|
|
+**Config constants** (top of file):
|
|
|
|
|
|
|
|
-**Behaviour:**
|
|
|
|
|
-- Displays current speaker label and resolved name (or "Unknown")
|
|
|
|
|
-- When a new unknown `SPEAKER_N` is detected, shows a prompt: "New speaker detected. Who is this?"
|
|
|
|
|
-- Operator types name and hits Enter
|
|
|
|
|
-- Calls `registry.assign()` and the display updates immediately
|
|
|
|
|
-- Also shows a manual override: operator can retype any name at any time
|
|
|
|
|
|
|
+- `MQTT_HOST`, `WS_URL`, `SAMPLE_RATE=16000`, `BLOCKSIZE=4096`
|
|
|
|
|
+- `SENTENCE_TIMEOUT=4.0`, `MAX_LINE_CHARS=38`, `DISPLAY_LINES=3`
|
|
|
|
|
+- `AUDIO_DEVICE=None` — set to an integer index to force a specific microphone
|
|
|
|
|
|
|
|
-### `esp32/src/main.cpp`
|
|
|
|
|
-ESP32 firmware. WiFi + MQTT client. Receives JSON payloads and renders to e-ink.
|
|
|
|
|
|
|
+### `bridge/admin.py`
|
|
|
|
|
|
|
|
-**Display rendering logic:**
|
|
|
|
|
-- On `speaker_changed: true`: full refresh, print speaker name in large CAPS on line 1, then print text lines below
|
|
|
|
|
-- On `speaker_changed: false`: partial refresh, overwrite text lines only (speaker header stays)
|
|
|
|
|
-- Track partial refresh count; force full refresh every 10 cycles to clear ghosting
|
|
|
|
|
-- Font: large enough for ~40 chars across 800px (approx FreeSans 18–24pt at this resolution)
|
|
|
|
|
|
|
+FastAPI web server on port 8001. Single-file — HTML/CSS/JS embedded as a Python string.
|
|
|
|
|
|
|
|
----
|
|
|
|
|
|
|
+**Endpoints:**
|
|
|
|
|
+
|
|
|
|
|
+- `GET /` — speaker management web UI
|
|
|
|
|
+- `GET|POST /api/speakers` — list / add speakers
|
|
|
|
|
+- `PUT|DELETE /api/speakers/{sid}` — rename / remove speaker
|
|
|
|
|
+- `POST|GET /api/speakers/{sid}/recording` — upload / serve per-speaker voice sample
|
|
|
|
|
+- `POST /api/test/upload` — upload full-service test recording
|
|
|
|
|
+- `GET /api/test/files` — list test recordings
|
|
|
|
|
+- `DELETE /api/test/files/{filename}` — delete test recording
|
|
|
|
|
+- `POST /api/test/start` — stream test recording to WhisperLiveKit (via `_stream_file()`)
|
|
|
|
|
+- `POST /api/test/stop` — cancel active playback
|
|
|
|
|
+- `GET /api/test/status` — playback progress / state
|
|
|
|
|
+- `GET /display` — fullscreen display page *(not yet implemented)*
|
|
|
|
|
+- `GET /api/display/stream` — SSE endpoint for display page *(not yet implemented)*
|
|
|
|
|
+
|
|
|
|
|
+**Test playback**: `_stream_file()` is an asyncio task that reads audio via `miniaudio.stream_file()` (handles WAV/MP3/FLAC/OGG/M4A, resamples to 16kHz mono) and streams chunks to `ws://localhost:8000/asr`, mimicking live microphone input.
|
|
|
|
|
+
|
|
|
|
|
+### `bridge/whisper_launcher.py`
|
|
|
|
|
+
|
|
|
|
|
+Startup wrapper for WhisperLiveKit. Applies ffmpeg PATH fix and torchaudio shim, then calls `whisperlivekit.cli.main()`. Used by `start.bat` instead of `wlk` directly.
|
|
|
|
|
|
|
|
-## Display Layout (800×480 pixels)
|
|
|
|
|
|
|
+### `bridge/speakers.json`
|
|
|
|
|
|
|
|
|
|
+Auto-created on first run. Format: `{"SPEAKER_00": "Pastor", "SPEAKER_01": "Reader", ...}`. Seeded with 4 defaults. Persists across sessions. Written by both `bridge.py` and `admin.py`; `bridge.py` polls mtime every 5s to pick up admin changes.
|
|
|
|
|
+
|
|
|
|
|
+### `bridge/requirements.txt`
|
|
|
|
|
+
|
|
|
|
|
+```text
|
|
|
|
|
+paho-mqtt>=2.0
|
|
|
|
|
+websockets>=12.0
|
|
|
|
|
+sounddevice>=0.4.6
|
|
|
|
|
+numpy>=1.24
|
|
|
|
|
+fastapi>=0.111
|
|
|
|
|
+uvicorn>=0.29
|
|
|
|
|
+python-multipart>=0.0.9
|
|
|
|
|
+miniaudio>=1.59
|
|
|
|
|
+imageio-ffmpeg>=2.9
|
|
|
```
|
|
```
|
|
|
-┌────────────────────────────────────────────────┐ ← full width
|
|
|
|
|
-│ PASTOR JOHN │ ← speaker name, top ~80px, bold/large
|
|
|
|
|
-│────────────────────────────────────────────────│
|
|
|
|
|
-│ ...and He said unto them, go into all the │ ← text line 1
|
|
|
|
|
-│ world and preach the gospel to every │ ← text line 2
|
|
|
|
|
-│ creature. He that believeth and is baptised │ ← text line 3
|
|
|
|
|
-└────────────────────────────────────────────────┘
|
|
|
|
|
|
|
+
|
|
|
|
|
+---
|
|
|
|
|
+
|
|
|
|
|
+## Display Layout
|
|
|
|
|
+
|
|
|
|
|
+Browser-based, scales to any screen size.
|
|
|
|
|
+
|
|
|
|
|
+```text
|
|
|
|
|
+┌─────────────────────────────────────────────────┐
|
|
|
|
|
+│ PASTOR JOHN │ ← speaker name, prominent, top section
|
|
|
|
|
+│─────────────────────────────────────────────────│
|
|
|
|
|
+│ ...and He said unto them, go into all the │ ← line 1
|
|
|
|
|
+│ world and preach the gospel to every │ ← line 2
|
|
|
|
|
+│ creature. He that believeth and is baptised │ ← line 3
|
|
|
|
|
+└─────────────────────────────────────────────────┘
|
|
|
```
|
|
```
|
|
|
|
|
|
|
|
-- Speaker name zone: top ~80px
|
|
|
|
|
-- Text zone: remaining ~380px, 3 lines at ~120px each
|
|
|
|
|
-- On speaker change: full clear, redraw both zones
|
|
|
|
|
-- On same speaker new text: partial refresh text zone only
|
|
|
|
|
|
|
+- Font size scales with viewport width — target readability at 3–5 metres
|
|
|
|
|
+- High contrast: white text on dark background recommended for bright church environments
|
|
|
|
|
+- Speaker name shown only when speaker changes (not repeated per line)
|
|
|
|
|
+- Instant update via SSE — no refresh flash
|
|
|
|
|
|
|
|
---
|
|
---
|
|
|
|
|
|
|
|
## Speaker Diarization Notes
|
|
## Speaker Diarization Notes
|
|
|
|
|
|
|
|
-### v1 — Operator-Assisted Naming
|
|
|
|
|
-- Zero prep before service
|
|
|
|
|
-- admin_ui.py shows prompt when new `SPEAKER_N` appears
|
|
|
|
|
-- Operator at sound desk types name (e.g. "Pastor John") once
|
|
|
|
|
-- Registry holds the mapping for the entire session
|
|
|
|
|
|
|
+### Active approach — diart (pyannote.audio)
|
|
|
|
|
|
|
|
-### v2 — Voice Enrolment (future)
|
|
|
|
|
-- Record 10–30s of each speaker saying natural speech (not word lists)
|
|
|
|
|
-- Extract speaker embedding using pyannote `SpeakerEmbedding` pipeline
|
|
|
|
|
-- Store embedding in `bridge/profiles/<name>.npy`
|
|
|
|
|
-- At runtime, compare incoming `SPEAKER_N` embedding to stored profiles
|
|
|
|
|
-- If cosine similarity > threshold (~0.85), auto-assign name
|
|
|
|
|
-- Fall back to operator prompt if no match above threshold
|
|
|
|
|
|
|
+- Launched via `--diarization-backend diart` in `whisper_launcher.py`
|
|
|
|
|
+- torchaudio compatibility shim applied in launcher (set_audio_backend removed in 2.x)
|
|
|
|
|
+- Tracks 2–4+ speakers reliably in clean audio conditions
|
|
|
|
|
+- Works best with direct mixer feed; background music may confuse diarization
|
|
|
|
|
+- Congregation responses ("Amen", "Hallelujah") appear as brief unknown speakers — minimum-duration filter (~2s) before triggering admin alert is a future improvement
|
|
|
|
|
|
|
|
-### Known Diarization Constraints
|
|
|
|
|
-- Streaming Sortformer tracks 2–4+ speakers reliably
|
|
|
|
|
-- Works best with clean, low-noise audio — direct mixer feed strongly preferred
|
|
|
|
|
-- Background music (worship) may confuse diarization; consider muting music channel on the transcription input
|
|
|
|
|
-- Congregation responses ("Amen", "Hallelujah") may appear as brief unknown speakers — consider a minimum-duration filter (~2s) before triggering a speaker prompt
|
|
|
|
|
|
|
+### v1 — Operator-Assisted Naming (current)
|
|
|
|
|
+
|
|
|
|
|
+- New `SPEAKER_XX` IDs appear automatically in the admin web table within 5s (via speakers.json polling)
|
|
|
|
|
+- Operator types the name in the table row and saves — takes effect immediately
|
|
|
|
|
+- Speaker names persist across sessions in speakers.json
|
|
|
|
|
+
|
|
|
|
|
+### v2 — Voice Enrolment (planned)
|
|
|
|
|
+
|
|
|
|
|
+- Upload 10–30s clear speech sample per speaker via admin page Voice Sample column (already implemented)
|
|
|
|
|
+- Extract embedding using pyannote `SpeakerEmbedding` pipeline
|
|
|
|
|
+- At runtime, compare incoming `SPEAKER_N` embedding to stored profiles
|
|
|
|
|
+- Auto-assign name if cosine similarity > threshold (~0.85); fall back to operator prompt otherwise
|
|
|
|
|
+- Embeddings stored in `bridge/profiles/<name>.npy`
|
|
|
|
|
|
|
|
---
|
|
---
|
|
|
|
|
|
|
|
## Design Constraints & Open Questions
|
|
## Design Constraints & Open Questions
|
|
|
|
|
|
|
|
-- [ ] Streaming Sortformer stability in WhisperLiveKit — test early; fall back to Diart if needed
|
|
|
|
|
-- [ ] Minimum speaker segment duration before triggering name prompt (avoid congregation one-liners)
|
|
|
|
|
-- [ ] Partial refresh ghosting — determine optimal full-refresh interval for the chosen display
|
|
|
|
|
-- [ ] ESP32-S3 PSRAM: confirm font glyph buffer fits; WROOM (no PSRAM) likely insufficient for large fonts
|
|
|
|
|
-- [ ] Word-wrap edge cases: long proper nouns, scripture references, place names
|
|
|
|
|
-- [ ] Session save/restore: if PC crashes mid-service, can operator reload speaker assignments quickly?
|
|
|
|
|
|
|
+- [ ] Display page `/display` not yet built — next major task
|
|
|
|
|
+- [ ] SSE push from admin.py to display browsers — requires admin.py to subscribe to MQTT or receive updates from bridge.py via shared state
|
|
|
|
|
+- [ ] Minimum speaker segment duration before adding to admin table (avoid congregation one-liners populating 50 rows)
|
|
|
|
|
+- [ ] Voice enrolment v2 — pyannote.audio is installed, extraction pipeline not yet written
|
|
|
|
|
+- [ ] Word-wrap edge cases: long proper nouns, scripture references
|
|
|
|
|
+- [ ] Session save/restore: if PC crashes mid-service, speakers.json persists so names reload immediately on restart
|
|
|
- [ ] Audio routing on Windows: ensure Whisper receives the mixer/mic channel, not system audio
|
|
- [ ] Audio routing on Windows: ensure Whisper receives the mixer/mic channel, not system audio
|
|
|
|
|
+- [ ] CUDA Toolkit 12.x installation required for GPU acceleration (cublas64_12.dll)
|
|
|
|
|
|
|
|
---
|
|
---
|
|
|
|
|
|
|
|
## Testing Approach
|
|
## Testing Approach
|
|
|
|
|
|
|
|
-1. **Whisper standalone**: speak into mic, verify text output in browser at `http://localhost:8000`
|
|
|
|
|
-2. **Diarization standalone**: two people alternate speaking, verify `SPEAKER_0` / `SPEAKER_1` labels in WS output
|
|
|
|
|
-3. **Registry + bridge**: run bridge.py, verify name prompts appear in admin_ui.py, verify MQTT payloads via `mosquitto_sub -t display/#`
|
|
|
|
|
-4. **ESP32 display**: send static MQTT messages manually before connecting bridge
|
|
|
|
|
-5. **End-to-end**: full pipeline test with recorded sermon audio (mix of 2–3 speakers)
|
|
|
|
|
-6. **In-situ trial**: 1–2 Sunday services with a volunteer congregant providing feedback
|
|
|
|
|
|
|
+1. **Whisper standalone**: speak into mic, verify text in browser at `http://localhost:8000`
|
|
|
|
|
+2. **Diarization**: two people alternate speaking, verify `SPEAKER_00` / `SPEAKER_01` labels in WS output
|
|
|
|
|
+3. **Bridge**: run `bridge.py`, verify MQTT payloads via `mosquitto_sub -t display/#`
|
|
|
|
|
+4. **Admin**: open `http://localhost:8001`, verify speaker rows appear, rename one, confirm bridge picks up the change within 5s
|
|
|
|
|
+5. **Test playback**: upload a full service recording via admin, press Play at 4×, verify transcription appears in MQTT and display
|
|
|
|
|
+6. **Display page**: open `http://[PC-IP]:8001/display` on tablet, verify text updates in real time
|
|
|
|
|
+7. **In-situ trial**: 1–2 Sunday services with a volunteer congregant providing feedback
|
|
|
|
|
+
|
|
|
|
|
+---
|
|
|
|
|
+
|
|
|
|
|
+## Development Sequence (Remaining)
|
|
|
|
|
+
|
|
|
|
|
+1. Build `/display` fullscreen browser page in `admin.py`
|
|
|
|
|
+2. Add SSE endpoint (`/api/display/stream`) in `admin.py` — subscribe to MQTT, push payloads to browsers
|
|
|
|
|
+3. Style display page: large font, dark background, speaker header, 3-line rolling text
|
|
|
|
|
+4. Install CUDA Toolkit 12.x on the production PC to enable GPU acceleration
|
|
|
|
|
+5. Voice enrolment v2 — extract pyannote embeddings from uploaded samples, add matching logic to bridge
|
|
|
|
|
+6. Church deployment trial
|
|
|
|
|
|
|
|
---
|
|
---
|
|
|
|
|
|
|
|
-## Development Sequence (Suggested)
|
|
|
|
|
|
|
+## Further Enhancements
|
|
|
|
|
|
|
|
-1. Get WhisperLiveKit running with `--diarization` flag, confirm WS output includes speaker labels
|
|
|
|
|
-2. Write `bridge.py` (transcription only, no diarization yet) → verify MQTT publish works
|
|
|
|
|
-3. Add `speaker_registry.py` and `admin_ui.py` → test name mapping loop
|
|
|
|
|
-4. Integrate diarization into bridge — handle `speaker_changed` logic
|
|
|
|
|
-5. Write ESP32 firmware — basic text display
|
|
|
|
|
-6. Add speaker header zone and refresh logic to ESP32 firmware
|
|
|
|
|
-7. Full end-to-end test on bench
|
|
|
|
|
-8. Church trial
|
|
|
|
|
|
|
+- Convert speakers.json to remote database for multi-event / multi-location usage
|
|
|
|
|
+- Transcription log: table of speaker name + first sentence of each turn, exportable after the service
|
|
|
|
|
+- Minimum-duration filter: suppress `SPEAKER_XX` rows for segments under ~2s (congregation responses)
|