# Church Live Transcription Display

A live speech-to-text system for deaf and hard-of-hearing congregants, displaying real-time transcriptions with speaker identification on an e-ink screen driven by an ESP32 microcontroller.

## Overview

Audio from the church service is captured on a Windows PC, transcribed and speaker-diarized locally using WhisperLiveKit, and the resulting speaker-tagged text is pushed over WiFi/MQTT to an ESP32 that drives a large e-ink display. The display shows who is speaking alongside what they are saying, updates in real time, and requires no internet connection.

```
[Microphone / Mixer]
        ↓
[Windows PC]
  ├── WhisperLiveKit  (transcription + speaker diarization)
  ├── Mosquitto MQTT broker
  ├── bridge.py       (WebSocket → name mapping → MQTT)
  └── Speaker Admin UI (operator names speakers live)
        ↓ MQTT / WiFi
[ESP32 + e-ink display]
```

## Goals

- Real-time captions with minimal latency (target: < 3 seconds end-to-end)
- Speaker identification — display who is speaking when the speaker changes
- Named speakers — operator maps anonymous speaker IDs to real names during service
- Future: voice enrolment so names are matched automatically from pre-recorded samples
- Runs entirely on local network — no cloud dependency
- Readable at distance with large font (36–48pt equivalent)
- Low cost, low complexity hardware

---

## Speaker Identification

### How It Works

WhisperLiveKit includes **Streaming Sortformer** (SOTA 2025), a real-time speaker diarization model developed by NVIDIA. It runs alongside Whisper transcription and tags each segment of speech with an anonymous speaker label (`SPEAKER_0`, `SPEAKER_1`, etc.).

A name mapping layer in the bridge script translates these anonymous labels into real names, which are then included in the MQTT payload sent to the display.

### Display Format

When a speaker changes, their name is shown as a header line above their words. The name is not repeated on every line — only when the speaker changes.

```
┌─────────────────────────────────┐
│ PASTOR JOHN                     │
│ ...and He said unto them, go    │
│ into all the world and preach   │
└─────────────────────────────────┘

┌─────────────────────────────────┐
│ MARY (READER)                   │
│ A reading from Luke chapter 4,  │
│ verse 18...                     │
└─────────────────────────────────┘
```

### Speaker Naming — Two Approaches

**v1 — Operator-Assisted Naming (implemented first)**

A simple admin UI runs on the PC alongside the bridge script. When a new unknown speaker is detected, the operator sees a prompt ("New speaker detected — who is this?") and types the name once. That name is stored for the session and used every time that speaker is detected again.

- No setup required before the service
- Works from the very first Sunday
- Operator (e.g. sound desk volunteer) assigns names as speakers appear

**v2 — Voice Enrolment (future upgrade)**

Before the service, a short voice sample (10–30 seconds) is recorded for each expected speaker. The bridge script compares incoming speaker embeddings against enrolled voices and automatically assigns the correct name without operator input.

- No operator intervention during the service
- More accurate for recurring speakers (pastor, regular readers)
- Enrolled voice profiles persist week to week

### Typical Church Speakers

| Role | Frequency | Notes |
|---|---|---|
| Pastor / Preacher | Every service | Primary speaker, longest segments |
| Worship leader | Most services | May overlap with congregation response |
| Reader / Scripture | Weekly | Short, distinct segments |
| Visiting speaker | Occasionally | New enrolment or operator naming needed |
| Announcements | Weekly | Often the same person each week |

---

## System Components

### PC Side (Windows)
- **WhisperLiveKit** — local GPU-accelerated transcription + diarization server
- **Mosquitto** — lightweight MQTT broker (same PC, port 1883)
- **bridge.py** — WebSocket subscriber, speaker name mapper, MQTT publisher
- **admin_ui.py** — lightweight operator interface for live speaker naming
- **speaker_registry.py** — manages speaker ID ↔ name mappings and voice enrolment

### ESP32 Side
- **ESP32-S3** — WiFi microcontroller (S3 preferred — PSRAM needed for large font bitmaps)
- **Waveshare e-ink display** — 7.5" V2 (800×480) or larger
- **GxEPD2** — display driver library
- **PubSubClient** — MQTT client library

---

## Hardware

| Component | Model | Notes |
|---|---|---|
| Microcontroller | ESP32-S3 | PSRAM required for large font bitmaps |
| Display | Waveshare 7.5" V2 e-Paper | 800×480, supports partial refresh |
| PC | Windows 10/11 with NVIDIA GPU | RTX series recommended |
| Microphone | USB condenser or direct mixer feed | Mixer feed preferred for clean diarization |

---

## Key Design Decisions

### Text & Speaker Buffering
The bridge script accumulates text until a sentence boundary or natural pause (~4s), then checks whether the speaker has changed. If the speaker is unchanged, only new text lines are pushed. If the speaker has changed, a full new payload is sent including the speaker name header, triggering a full display refresh.

### Display Layout
- **Line 1:** Speaker name in CAPS — printed only on speaker change
- **Lines 2–4:** Rolling transcription text, wrapping at ~40 chars per line
- On speaker change: full screen clear then redraw with new name header
- Font targets readability at 3–5 metres

### E-ink Refresh Strategy
- Speaker change → **full refresh** (~1.5s flash — clean slate, acceptable at speaker transition)
- Same speaker, new text → **partial refresh** (~300ms, minor ghosting)
- Force full refresh every 10 partial refreshes to clear accumulated ghosting

### Network
- All traffic on local WiFi (church LAN or dedicated hotspot)
- MQTT broker on Windows PC (port 1883)
- Static IP recommended for ESP32 to avoid reconnection delays

---

## Repository Structure

```
/
├── README.md                     — This file
├── CLAUDE.md                     — AI assistant context for development sessions
├── bridge/
│   ├── bridge.py                 — Main bridge: Whisper WS → name map → MQTT
│   ├── speaker_registry.py       — Speaker ID ↔ name mapping and voice enrolment
│   └── admin_ui.py               — Operator UI for live speaker naming (Tkinter)
├── esp32/
│   ├── src/
│   │   └── main.cpp              — ESP32 Arduino firmware
│   └── platformio.ini            — PlatformIO build config
└── docs/
    ├── hardware-wiring.md        — SPI pin connections for Waveshare display
    ├── setup.md                  — Installation and configuration guide
    └── speaker-enrolment.md     — Guide for recording and enrolling voice samples (v2)
```

---

## Reference Projects

- [WhisperLiveKit](https://github.com/QuentinFuxa/WhisperLiveKit) — real-time Whisper + Streaming Sortformer diarization
- [NVIDIA Streaming Sortformer](https://developer.nvidia.com/blog/identify-speakers-in-meetings-calls-and-voice-apps-in-real-time-with-nvidia-streaming-sortformer/) — the diarization model integrated into WhisperLiveKit
- [pyannote.audio](https://github.com/pyannote/pyannote-audio) — fallback diarization (Diart integration in WhisperLiveKit)
- [denwilliams/mqtt-epaper](https://github.com/denwilliams/mqtt-epaper) — ESP32 e-paper display driven by MQTT
- [cuci90/epaper_mqtt_esp32](https://github.com/cuci90/epaper_mqtt_esp32) — ESP32 Waveshare display MQTT template

---

## Status

🟡 **Planning / Research phase**

- [x] Architecture defined
- [x] Speaker diarization approach selected (WhisperLiveKit + Streaming Sortformer)
- [x] Speaker naming strategy defined (operator-assisted v1, voice enrolment v2)
- [ ] Python bridge script (transcription only)
- [ ] Speaker name mapping layer (`speaker_registry.py`)
- [ ] Operator admin UI (`admin_ui.py`)
- [ ] ESP32 firmware — basic text display
- [ ] ESP32 firmware — speaker header layout + refresh logic
- [ ] Hardware wiring and bench test
- [ ] End-to-end integration test
- [ ] Voice enrolment system (v2)
- [ ] Church deployment trial