Ingen beskrivning

Benjamin Harris 49e65a4568 Add Setup Instructions 1 månad sedan
bridge ee90055248 Inital Hardare Commit 1 månad sedan
docs ee90055248 Inital Hardare Commit 1 månad sedan
esp32 ee90055248 Inital Hardare Commit 1 månad sedan
CLAUDE.md ee90055248 Inital Hardare Commit 1 månad sedan
README.md ee90055248 Inital Hardare Commit 1 månad sedan
SETUP.md 49e65a4568 Add Setup Instructions 1 månad sedan
install.bat 49e65a4568 Add Setup Instructions 1 månad sedan
start.bat 49e65a4568 Add Setup Instructions 1 månad sedan

README.md

Church Live Transcription Display

A live speech-to-text system for deaf and hard-of-hearing congregants, displaying real-time transcriptions with speaker identification on an e-ink screen driven by an ESP32 microcontroller.

Overview

Audio from the church service is captured on a Windows PC, transcribed and speaker-diarized locally using WhisperLiveKit, and the resulting speaker-tagged text is pushed over WiFi/MQTT to an ESP32 that drives a large e-ink display. The display shows who is speaking alongside what they are saying, updates in real time, and requires no internet connection.

[Microphone / Mixer]
        ↓
[Windows PC]
  ├── WhisperLiveKit  (transcription + speaker diarization)
  ├── Mosquitto MQTT broker
  ├── bridge.py       (WebSocket → name mapping → MQTT)
  └── Speaker Admin UI (operator names speakers live)
        ↓ MQTT / WiFi
[ESP32 + e-ink display]

Goals

  • Real-time captions with minimal latency (target: < 3 seconds end-to-end)
  • Speaker identification — display who is speaking when the speaker changes
  • Named speakers — operator maps anonymous speaker IDs to real names during service
  • Future: voice enrolment so names are matched automatically from pre-recorded samples
  • Runs entirely on local network — no cloud dependency
  • Readable at distance with large font (36–48pt equivalent)
  • Low cost, low complexity hardware

Speaker Identification

How It Works

WhisperLiveKit includes Streaming Sortformer (SOTA 2025), a real-time speaker diarization model developed by NVIDIA. It runs alongside Whisper transcription and tags each segment of speech with an anonymous speaker label (SPEAKER_0, SPEAKER_1, etc.).

A name mapping layer in the bridge script translates these anonymous labels into real names, which are then included in the MQTT payload sent to the display.

Display Format

When a speaker changes, their name is shown as a header line above their words. The name is not repeated on every line — only when the speaker changes.

┌─────────────────────────────────┐
│ PASTOR JOHN                     │
│ ...and He said unto them, go    │
│ into all the world and preach   │
└─────────────────────────────────┘

┌─────────────────────────────────┐
│ MARY (READER)                   │
│ A reading from Luke chapter 4,  │
│ verse 18...                     │
└─────────────────────────────────┘

Speaker Naming — Two Approaches

v1 — Operator-Assisted Naming (implemented first)

A simple admin UI runs on the PC alongside the bridge script. When a new unknown speaker is detected, the operator sees a prompt ("New speaker detected — who is this?") and types the name once. That name is stored for the session and used every time that speaker is detected again.

  • No setup required before the service
  • Works from the very first Sunday
  • Operator (e.g. sound desk volunteer) assigns names as speakers appear

v2 — Voice Enrolment (future upgrade)

Before the service, a short voice sample (10–30 seconds) is recorded for each expected speaker. The bridge script compares incoming speaker embeddings against enrolled voices and automatically assigns the correct name without operator input.

  • No operator intervention during the service
  • More accurate for recurring speakers (pastor, regular readers)
  • Enrolled voice profiles persist week to week

Typical Church Speakers

Role Frequency Notes
Pastor / Preacher Every service Primary speaker, longest segments
Worship leader Most services May overlap with congregation response
Reader / Scripture Weekly Short, distinct segments
Visiting speaker Occasionally New enrolment or operator naming needed
Announcements Weekly Often the same person each week

System Components

PC Side (Windows)

  • WhisperLiveKit — local GPU-accelerated transcription + diarization server
  • Mosquitto — lightweight MQTT broker (same PC, port 1883)
  • bridge.py — WebSocket subscriber, speaker name mapper, MQTT publisher
  • admin_ui.py — lightweight operator interface for live speaker naming
  • speaker_registry.py — manages speaker ID ↔ name mappings and voice enrolment

ESP32 Side

  • ESP32-S3 — WiFi microcontroller (S3 preferred — PSRAM needed for large font bitmaps)
  • Waveshare e-ink display — 7.5" V2 (800×480) or larger
  • GxEPD2 — display driver library
  • PubSubClient — MQTT client library

Hardware

Component Model Notes
Microcontroller ESP32-S3 PSRAM required for large font bitmaps
Display Waveshare 7.5" V2 e-Paper 800×480, supports partial refresh
PC Windows 10/11 with NVIDIA GPU RTX series recommended
Microphone USB condenser or direct mixer feed Mixer feed preferred for clean diarization

Key Design Decisions

Text & Speaker Buffering

The bridge script accumulates text until a sentence boundary or natural pause (~4s), then checks whether the speaker has changed. If the speaker is unchanged, only new text lines are pushed. If the speaker has changed, a full new payload is sent including the speaker name header, triggering a full display refresh.

Display Layout

  • Line 1: Speaker name in CAPS — printed only on speaker change
  • Lines 2–4: Rolling transcription text, wrapping at ~40 chars per line
  • On speaker change: full screen clear then redraw with new name header
  • Font targets readability at 3–5 metres

E-ink Refresh Strategy

  • Speaker change → full refresh (~1.5s flash — clean slate, acceptable at speaker transition)
  • Same speaker, new text → partial refresh (~300ms, minor ghosting)
  • Force full refresh every 10 partial refreshes to clear accumulated ghosting

Network

  • All traffic on local WiFi (church LAN or dedicated hotspot)
  • MQTT broker on Windows PC (port 1883)
  • Static IP recommended for ESP32 to avoid reconnection delays

Repository Structure

/
├── README.md                     — This file
├── CLAUDE.md                     — AI assistant context for development sessions
├── bridge/
│   ├── bridge.py                 — Main bridge: Whisper WS → name map → MQTT
│   ├── speaker_registry.py       — Speaker ID ↔ name mapping and voice enrolment
│   └── admin_ui.py               — Operator UI for live speaker naming (Tkinter)
├── esp32/
│   ├── src/
│   │   └── main.cpp              — ESP32 Arduino firmware
│   └── platformio.ini            — PlatformIO build config
└── docs/
    ├── hardware-wiring.md        — SPI pin connections for Waveshare display
    ├── setup.md                  — Installation and configuration guide
    └── speaker-enrolment.md     — Guide for recording and enrolling voice samples (v2)

Reference Projects


Status

🟡 Planning / Research phase

  • Architecture defined
  • Speaker diarization approach selected (WhisperLiveKit + Streaming Sortformer)
  • Speaker naming strategy defined (operator-assisted v1, voice enrolment v2)
  • Python bridge script (transcription only)
  • Speaker name mapping layer (speaker_registry.py)
  • Operator admin UI (admin_ui.py)
  • ESP32 firmware — basic text display
  • ESP32 firmware — speaker header layout + refresh logic
  • Hardware wiring and bench test
  • End-to-end integration test
  • Voice enrolment system (v2)
  • Church deployment trial