Ingen beskrivning

Benjamin Harris 49e65a4568 Add Setup Instructions		1 månad sedan
bridge	ee90055248 Inital Hardare Commit	1 månad sedan
docs	ee90055248 Inital Hardare Commit	1 månad sedan
esp32	ee90055248 Inital Hardare Commit	1 månad sedan
CLAUDE.md	ee90055248 Inital Hardare Commit	1 månad sedan
README.md	ee90055248 Inital Hardare Commit	1 månad sedan
SETUP.md	49e65a4568 Add Setup Instructions	1 månad sedan
install.bat	49e65a4568 Add Setup Instructions	1 månad sedan
start.bat	49e65a4568 Add Setup Instructions	1 månad sedan

Church Live Transcription Display

A live speech-to-text system for deaf and hard-of-hearing congregants, displaying real-time transcriptions with speaker identification on an e-ink screen driven by an ESP32 microcontroller.

Overview

Audio from the church service is captured on a Windows PC, transcribed and speaker-diarized locally using WhisperLiveKit, and the resulting speaker-tagged text is pushed over WiFi/MQTT to an ESP32 that drives a large e-ink display. The display shows who is speaking alongside what they are saying, updates in real time, and requires no internet connection.

[Microphone / Mixer]
        ↓
[Windows PC]
  ├── WhisperLiveKit  (transcription + speaker diarization)
  ├── Mosquitto MQTT broker
  ├── bridge.py       (WebSocket → name mapping → MQTT)
  └── Speaker Admin UI (operator names speakers live)
        ↓ MQTT / WiFi
[ESP32 + e-ink display]

Goals

Real-time captions with minimal latency (target: < 3 seconds end-to-end)
Speaker identification — display who is speaking when the speaker changes
Named speakers — operator maps anonymous speaker IDs to real names during service
Future: voice enrolment so names are matched automatically from pre-recorded samples
Runs entirely on local network — no cloud dependency
Readable at distance with large font (36–48pt equivalent)
Low cost, low complexity hardware

Speaker Identification

How It Works

WhisperLiveKit includes Streaming Sortformer (SOTA 2025), a real-time speaker diarization model developed by NVIDIA. It runs alongside Whisper transcription and tags each segment of speech with an anonymous speaker label (SPEAKER_0, SPEAKER_1, etc.).

A name mapping layer in the bridge script translates these anonymous labels into real names, which are then included in the MQTT payload sent to the display.

Display Format

When a speaker changes, their name is shown as a header line above their words. The name is not repeated on every line — only when the speaker changes.

┌─────────────────────────────────┐
│ PASTOR JOHN                     │
│ ...and He said unto them, go    │
│ into all the world and preach   │
└─────────────────────────────────┘

┌─────────────────────────────────┐
│ MARY (READER)                   │
│ A reading from Luke chapter 4,  │
│ verse 18...                     │
└─────────────────────────────────┘

Speaker Naming — Two Approaches

v1 — Operator-Assisted Naming (implemented first)

A simple admin UI runs on the PC alongside the bridge script. When a new unknown speaker is detected, the operator sees a prompt ("New speaker detected — who is this?") and types the name once. That name is stored for the session and used every time that speaker is detected again.

No setup required before the service
Works from the very first Sunday
Operator (e.g. sound desk volunteer) assigns names as speakers appear

v2 — Voice Enrolment (future upgrade)

Before the service, a short voice sample (10–30 seconds) is recorded for each expected speaker. The bridge script compares incoming speaker embeddings against enrolled voices and automatically assigns the correct name without operator input.

No operator intervention during the service
More accurate for recurring speakers (pastor, regular readers)
Enrolled voice profiles persist week to week

Typical Church Speakers

Role	Frequency	Notes
Pastor / Preacher	Every service	Primary speaker, longest segments
Worship leader	Most services	May overlap with congregation response
Reader / Scripture	Weekly	Short, distinct segments
Visiting speaker	Occasionally	New enrolment or operator naming needed
Announcements	Weekly	Often the same person each week

System Components

PC Side (Windows)

WhisperLiveKit — local GPU-accelerated transcription + diarization server
Mosquitto — lightweight MQTT broker (same PC, port 1883)
bridge.py — WebSocket subscriber, speaker name mapper, MQTT publisher
admin_ui.py — lightweight operator interface for live speaker naming
speaker_registry.py — manages speaker ID ↔ name mappings and voice enrolment

ESP32 Side

ESP32-S3 — WiFi microcontroller (S3 preferred — PSRAM needed for large font bitmaps)
Waveshare e-ink display — 7.5" V2 (800×480) or larger
GxEPD2 — display driver library
PubSubClient — MQTT client library

Hardware

Component	Model	Notes
Microcontroller	ESP32-S3	PSRAM required for large font bitmaps
Display	Waveshare 7.5" V2 e-Paper	800×480, supports partial refresh
PC	Windows 10/11 with NVIDIA GPU	RTX series recommended
Microphone	USB condenser or direct mixer feed	Mixer feed preferred for clean diarization

Key Design Decisions

Text & Speaker Buffering

The bridge script accumulates text until a sentence boundary or natural pause (~4s), then checks whether the speaker has changed. If the speaker is unchanged, only new text lines are pushed. If the speaker has changed, a full new payload is sent including the speaker name header, triggering a full display refresh.

Display Layout

Line 1: Speaker name in CAPS — printed only on speaker change
Lines 2–4: Rolling transcription text, wrapping at ~40 chars per line
On speaker change: full screen clear then redraw with new name header
Font targets readability at 3–5 metres

E-ink Refresh Strategy

Speaker change → full refresh (~1.5s flash — clean slate, acceptable at speaker transition)
Same speaker, new text → partial refresh (~300ms, minor ghosting)
Force full refresh every 10 partial refreshes to clear accumulated ghosting

Network

All traffic on local WiFi (church LAN or dedicated hotspot)
MQTT broker on Windows PC (port 1883)
Static IP recommended for ESP32 to avoid reconnection delays

Repository Structure

/
├── README.md                     — This file
├── CLAUDE.md                     — AI assistant context for development sessions
├── bridge/
│   ├── bridge.py                 — Main bridge: Whisper WS → name map → MQTT
│   ├── speaker_registry.py       — Speaker ID ↔ name mapping and voice enrolment
│   └── admin_ui.py               — Operator UI for live speaker naming (Tkinter)
├── esp32/
│   ├── src/
│   │   └── main.cpp              — ESP32 Arduino firmware
│   └── platformio.ini            — PlatformIO build config
└── docs/
    ├── hardware-wiring.md        — SPI pin connections for Waveshare display
    ├── setup.md                  — Installation and configuration guide
    └── speaker-enrolment.md     — Guide for recording and enrolling voice samples (v2)

Reference Projects

WhisperLiveKit — real-time Whisper + Streaming Sortformer diarization
NVIDIA Streaming Sortformer — the diarization model integrated into WhisperLiveKit
pyannote.audio — fallback diarization (Diart integration in WhisperLiveKit)
denwilliams/mqtt-epaper — ESP32 e-paper display driven by MQTT
cuci90/epaper_mqtt_esp32 — ESP32 Waveshare display MQTT template

Status

🟡 Planning / Research phase

Architecture defined
Speaker diarization approach selected (WhisperLiveKit + Streaming Sortformer)
Speaker naming strategy defined (operator-assisted v1, voice enrolment v2)
Python bridge script (transcription only)
Speaker name mapping layer (speaker_registry.py)
Operator admin UI (admin_ui.py)
ESP32 firmware — basic text display
ESP32 firmware — speaker header layout + refresh logic
Hardware wiring and bench test
End-to-end integration test
Voice enrolment system (v2)
Church deployment trial