# Church Live Transcription Display A live speech-to-text system for deaf and hard-of-hearing congregants, displaying real-time transcriptions with speaker identification on an e-ink screen driven by an ESP32 microcontroller. ## Overview Audio from the church service is captured on a Windows PC, transcribed and speaker-diarized locally using WhisperLiveKit, and the resulting speaker-tagged text is pushed over WiFi/MQTT to an ESP32 that drives a large e-ink display. The display shows who is speaking alongside what they are saying, updates in real time, and requires no internet connection. ``` [Microphone / Mixer] ↓ [Windows PC] ├── WhisperLiveKit (transcription + speaker diarization) ├── Mosquitto MQTT broker ├── bridge.py (WebSocket → name mapping → MQTT) └── Speaker Admin UI (operator names speakers live) ↓ MQTT / WiFi [ESP32 + e-ink display] ``` ## Goals - Real-time captions with minimal latency (target: < 3 seconds end-to-end) - Speaker identification — display who is speaking when the speaker changes - Named speakers — operator maps anonymous speaker IDs to real names during service - Future: voice enrolment so names are matched automatically from pre-recorded samples - Runs entirely on local network — no cloud dependency - Readable at distance with large font (36–48pt equivalent) - Low cost, low complexity hardware --- ## Speaker Identification ### How It Works WhisperLiveKit includes **Streaming Sortformer** (SOTA 2025), a real-time speaker diarization model developed by NVIDIA. It runs alongside Whisper transcription and tags each segment of speech with an anonymous speaker label (`SPEAKER_0`, `SPEAKER_1`, etc.). A name mapping layer in the bridge script translates these anonymous labels into real names, which are then included in the MQTT payload sent to the display. ### Display Format When a speaker changes, their name is shown as a header line above their words. The name is not repeated on every line — only when the speaker changes. ``` ┌─────────────────────────────────┐ │ PASTOR JOHN │ │ ...and He said unto them, go │ │ into all the world and preach │ └─────────────────────────────────┘ ┌─────────────────────────────────┐ │ MARY (READER) │ │ A reading from Luke chapter 4, │ │ verse 18... │ └─────────────────────────────────┘ ``` ### Speaker Naming — Two Approaches **v1 — Operator-Assisted Naming (implemented first)** A simple admin UI runs on the PC alongside the bridge script. When a new unknown speaker is detected, the operator sees a prompt ("New speaker detected — who is this?") and types the name once. That name is stored for the session and used every time that speaker is detected again. - No setup required before the service - Works from the very first Sunday - Operator (e.g. sound desk volunteer) assigns names as speakers appear **v2 — Voice Enrolment (future upgrade)** Before the service, a short voice sample (10–30 seconds) is recorded for each expected speaker. The bridge script compares incoming speaker embeddings against enrolled voices and automatically assigns the correct name without operator input. - No operator intervention during the service - More accurate for recurring speakers (pastor, regular readers) - Enrolled voice profiles persist week to week ### Typical Church Speakers | Role | Frequency | Notes | |---|---|---| | Pastor / Preacher | Every service | Primary speaker, longest segments | | Worship leader | Most services | May overlap with congregation response | | Reader / Scripture | Weekly | Short, distinct segments | | Visiting speaker | Occasionally | New enrolment or operator naming needed | | Announcements | Weekly | Often the same person each week | --- ## System Components ### PC Side (Windows) - **WhisperLiveKit** — local GPU-accelerated transcription + diarization server - **Mosquitto** — lightweight MQTT broker (same PC, port 1883) - **bridge.py** — WebSocket subscriber, speaker name mapper, MQTT publisher - **admin_ui.py** — lightweight operator interface for live speaker naming - **speaker_registry.py** — manages speaker ID ↔ name mappings and voice enrolment ### ESP32 Side - **ESP32-S3** — WiFi microcontroller (S3 preferred — PSRAM needed for large font bitmaps) - **Waveshare e-ink display** — 7.5" V2 (800×480) or larger - **GxEPD2** — display driver library - **PubSubClient** — MQTT client library --- ## Hardware | Component | Model | Notes | |---|---|---| | Microcontroller | ESP32-S3 | PSRAM required for large font bitmaps | | Display | Waveshare 7.5" V2 e-Paper | 800×480, supports partial refresh | | PC | Windows 10/11 with NVIDIA GPU | RTX series recommended | | Microphone | USB condenser or direct mixer feed | Mixer feed preferred for clean diarization | --- ## Key Design Decisions ### Text & Speaker Buffering The bridge script accumulates text until a sentence boundary or natural pause (~4s), then checks whether the speaker has changed. If the speaker is unchanged, only new text lines are pushed. If the speaker has changed, a full new payload is sent including the speaker name header, triggering a full display refresh. ### Display Layout - **Line 1:** Speaker name in CAPS — printed only on speaker change - **Lines 2–4:** Rolling transcription text, wrapping at ~40 chars per line - On speaker change: full screen clear then redraw with new name header - Font targets readability at 3–5 metres ### E-ink Refresh Strategy - Speaker change → **full refresh** (~1.5s flash — clean slate, acceptable at speaker transition) - Same speaker, new text → **partial refresh** (~300ms, minor ghosting) - Force full refresh every 10 partial refreshes to clear accumulated ghosting ### Network - All traffic on local WiFi (church LAN or dedicated hotspot) - MQTT broker on Windows PC (port 1883) - Static IP recommended for ESP32 to avoid reconnection delays --- ## Repository Structure ``` / ├── README.md — This file ├── CLAUDE.md — AI assistant context for development sessions ├── bridge/ │ ├── bridge.py — Main bridge: Whisper WS → name map → MQTT │ ├── speaker_registry.py — Speaker ID ↔ name mapping and voice enrolment │ └── admin_ui.py — Operator UI for live speaker naming (Tkinter) ├── esp32/ │ ├── src/ │ │ └── main.cpp — ESP32 Arduino firmware │ └── platformio.ini — PlatformIO build config └── docs/ ├── hardware-wiring.md — SPI pin connections for Waveshare display ├── setup.md — Installation and configuration guide └── speaker-enrolment.md — Guide for recording and enrolling voice samples (v2) ``` --- ## Reference Projects - [WhisperLiveKit](https://github.com/QuentinFuxa/WhisperLiveKit) — real-time Whisper + Streaming Sortformer diarization - [NVIDIA Streaming Sortformer](https://developer.nvidia.com/blog/identify-speakers-in-meetings-calls-and-voice-apps-in-real-time-with-nvidia-streaming-sortformer/) — the diarization model integrated into WhisperLiveKit - [pyannote.audio](https://github.com/pyannote/pyannote-audio) — fallback diarization (Diart integration in WhisperLiveKit) - [denwilliams/mqtt-epaper](https://github.com/denwilliams/mqtt-epaper) — ESP32 e-paper display driven by MQTT - [cuci90/epaper_mqtt_esp32](https://github.com/cuci90/epaper_mqtt_esp32) — ESP32 Waveshare display MQTT template --- ## Status 🟡 **Planning / Research phase** - [x] Architecture defined - [x] Speaker diarization approach selected (WhisperLiveKit + Streaming Sortformer) - [x] Speaker naming strategy defined (operator-assisted v1, voice enrolment v2) - [ ] Python bridge script (transcription only) - [ ] Speaker name mapping layer (`speaker_registry.py`) - [ ] Operator admin UI (`admin_ui.py`) - [ ] ESP32 firmware — basic text display - [ ] ESP32 firmware — speaker header layout + refresh logic - [ ] Hardware wiring and bench test - [ ] End-to-end integration test - [ ] Voice enrolment system (v2) - [ ] Church deployment trial