The README shows voice enrolment v2 is the next planned feature, and per-speaker voice sample upload already exists in admin.py. That's the right foundation to build on. --- ### Voiceprint / Speaker Embedding Design **To answer your core question first:** No, don't store the embedding in `speakers.json`. A speaker embedding is a float32 numpy array of 512 dimensions — about 2KB of binary data. JSON can technically hold it as a list of floats but it becomes unwieldy and slow to load. The clean approach is: ``` bridge/ ├── speakers.json ← names, metadata (text only, fast to load) └── embeddings/ ├── SPEAKER_00.npy ← numpy binary, one file per speaker ├── SPEAKER_01.npy └── SPEAKER_02.npy ``` When you later move to a remote DB, the `.npy` files become `BLOB` columns or vector embeddings in something like pgvector. The `speakers.json` structure stays identical — you just add an `"has_embedding": true` flag. --- ### Extracting a Voiceprint From a 10-Second Snip This is the right approach — far better than asking someone to "say a sentence". Here's the full picture: **The tool:** `pyannote.audio`'s `SpeakerEmbedding` pipeline — already in your venv as a dependency of diart. It takes an audio segment and returns a 512-dimensional embedding vector that acts as a voice fingerprint. **The process for extracting from an existing recording:** 1. Operator opens admin.py, navigates to a speaker 2. Selects a timestamp range from a test recording (e.g. "00:45 → 00:55") 3. Admin server extracts that 10s segment using miniaudio or ffmpeg 4. Passes it to `SpeakerEmbedding` pipeline 5. Saves the resulting `.npy` file to `embeddings/SPEAKER_XX.npy` **At runtime during a service:** When diart assigns a `SPEAKER_XX` label, bridge.py checks if an embedding exists for any unmatched speaker ID. If the current speaker's embedding (accumulated over ~5 seconds of live speech) cosine-similarity matches a stored embedding above threshold (~0.82), it auto-assigns the name without operator input. --- ### The `speakers.json` Schema Change Extend it from the current simple `{id: name}` to: json ```json { "SPEAKER_00": { "name": "Pastor John", "has_embedding": true, "embedding_updated": "2026-05-04T09:32:00", "colour": "#2563eb", "notes": "Senior pastor" }, "SPEAKER_01": { "name": "Mary (Reader)", "has_embedding": false, "embedding_updated": null, "colour": "#16a34a", "notes": "" } } ``` The bridge reads `has_embedding` to decide whether to attempt auto-matching. The actual vector lives in `embeddings/SPEAKER_XX.npy`. --- ### Implementation Plan This is a 3-part addition: **Part 1 —** `embeddings.py` (new file, ~80 lines) A small module that handles extraction, storage and matching: python ```python # Core interface extract_embedding(audio_path, start_sec, end_sec) -> np.ndarray save_embedding(speaker_id, embedding) load_embedding(speaker_id) -> np.ndarray | None find_best_match(live_embedding, threshold=0.82) -> tuple[str, float] | None ``` **Part 2 — Admin UI additions** On each speaker row, add a "Extract from recording" button that lets the operator: - Pick an uploaded test recording from a dropdown - Set a start/end time (or use a simple "pick a 10s window" slider) - Click "Extract" — server processes it and saves the `.npy` - Row updates to show "✓ Voiceprint saved" **Part 3 — Bridge auto-matching** In `push_final()`, when a new unknown speaker appears, compare their accumulated audio embedding against all stored profiles before prompting the operator. If a match is found above threshold, auto-assign the name silently. --- Want me to write `embeddings.py` first, then the admin UI changes, then the bridge matching logic? The `embeddings.py` module is the cleanest starting point — it's self-contained and we can test it standalone before wiring it into the UI. ## You said: ok lets get teh embeddings. ok lets get teh embeddings.py under way 3:30 PM ## Claude responded: Let me check the current bridge structure first before writing it. Let me check the current bridge structure first before writing it. Searched the web Searched the web The repo is requiring login for raw file access now. No matter — I have enough context from the session and the README. Let me write the module. Viewed a file, created a file, read a file Viewed a file, created a file, read a file Here's what's in it and why each piece is designed the way it is. **Storage** — `.npy` files in `bridge/embeddings/`, one per speaker. `speakers.json` gains `has_embedding`, `embedding_updated`, `colour`, and `notes` fields but the schema is backward compatible — existing flat `{"SPEAKER_00": "Pastor John"}` entries are migrated transparently via `_normalise_speaker_entry()`. **Extraction** — `extract()` uses miniaudio (already in your venv) to pull a segment by start/end seconds, writes a temp WAV, feeds it to pyannote's `SpeakerEmbedding` pipeline, then deletes the temp file. The pipeline is lazy-loaded on first call so import is fast. `LiveEmbeddingAccumulator` — this is the piece that plugs into bridge.py. When diart announces a new `SPEAKER_XX`, bridge creates an accumulator, feeds it PCM chunks via `push()`, and once `ready()` returns True (5 seconds by default) calls `extract_embedding()` + `registry.find_match()`. If it matches an enrolled speaker, the name is auto-assigned without operator prompt. **CLI** — the file is executable standalone for testing without starting the full stack: bash ```bash # Enrol from seconds 45–55 of last Sunday's recording python bridge\embeddings.py enrol SPEAKER_00 test_recordings\service.mp3 --start 45 --end 55 # Check what it matched against all enrolled speakers python bridge\embeddings.py scores test_recordings\service.mp3 --start 45 --end 55 # List who's enrolled python bridge\embeddings.py list ``` Next steps are the admin UI additions (a segment picker per speaker row) and wiring `LiveEmbeddingAccumulator` into `bridge.py`'s `push_final()` path. ### What a Voiceprint Actually Is The embedding captures **acoustic characteristics** of a voice — things like fundamental frequency, formant patterns, speaking rhythm, vocal tract shape. It has nothing to do with the words spoken. So a person saying "the quick brown fox" and saying "let us pray" would produce very similar embeddings, because it's the _voice_ not the _words_ being fingerprinted. So transcription accuracy doesn't directly improve the embedding quality. --- ### What _Does_ Improve Embedding Quality The factors that matter are: | | | | --- | --- | | Factor | Impact | | **Audio cleanliness** | Highest — noise, reverb, music bleed all degrade it | | **Segment length** | 10–20s is the sweet spot; diminishing returns after 30s | | **Single speaker** | Must be isolated — no overlapping voices in the segment | | **Consistent mic distance** | Close-mic (pulpit mic) is better than room mic | | **Multiple samples averaged** | Enrolling 2–3 separate clips and averaging the embeddings is more robust than one clip | --- ### Where Transcription _Does_ Help — Indirectly Here's the indirect benefit you're probably intuiting: **If you know what was said, you can verify the segment is clean.** For example: - If the transcript shows `[SPEAKER_00] "Let us open in prayer..."` — you know that's isolated speech, no congregation response, no overlap. That's a good enrolment segment. - If the transcript shows `[SPEAKER_00] "[inaudible] ...mmm... [SPEAKER_01]"` — overlapping speakers, bad segment, skip it. So the workflow becomes: ``` 1. Run a test recording through the pipeline 2. Read the transcript output to find clean, isolated segments per speaker 3. Note the timestamps of those clean segments 4. Use embeddings.py enrol with those specific timestamps ``` This is actually a **much better enrolment strategy** than asking someone to record a sample — because the transcript lets you cherry-pick the cleanest real-world speech from an actual service. --- ### Practical Addition to embeddings.py We could add a `enrol_from_transcript()` helper that takes the transcript log (which bridge.py already produces) and automatically picks the best segment per speaker — longest clean run with no speaker change within it: python ```python def enrol_from_transcript( speaker_id: str, audio_path: Path, transcript_segments: list[dict], # [{speaker, start, end, text}, ...] min_duration: float = 8.0, ) -> np.ndarray: """ Find the longest clean single-speaker segment in a transcript and use it for enrolment. Avoids segments near speaker changes. """ candidates = [ s for s in transcript_segments if s["speaker"] == speaker_id and (s["end"] - s["start"]) >= min_duration ] if not candidates: raise ValueError(f"No clean segment >= {min_duration}s found for {speaker_id}") # Pick longest segment best = max(candidates, key=lambda s: s["end"] - s["start"]) return self.extract_and_save(speaker_id, audio_path, best["start"], best["end"]) ``` This would mean after every test recording run, bridge.py logs timestamped segments, and the admin UI could show a "Use best segment for enrolment" button per speaker — fully automatic, no timestamp hunting needed. Want me to add that to `embeddings.py` and also add the transcript segment logging to `bridge.py` so the timestamps are captured? ## You said: ok lets update them both let me know where i need to put the block into the bridge. ok lets update them both let me know where i need to put the block into the bridge.py and embeddings.py 3:47 PM ## Claude responded: Let me fetch the current bridge. Let me fetch the current bridge.py first so I place the changes exactly right. Failed to fetch: https://git.tazz.com.au/benjamin.harris/Deaf_Transcription_Service/src/main/bridge/bridge.py Failed to fetch: https://git.tazz.com.au/benjamin.harris/Deaf_Transcription_Service/src/main/bridge/bridge.py The repo is still requiring login for file access. Can you paste the current `bridge.py` content here? Specifically I need to see: 1. The `push_final()` method in `BridgeState` 2. The `_receive_results()` async function 3. The `main()` function Those are the three places I need to insert code. While you grab those, here's the complete updated `embeddings.py` with the transcript-based enrolment added: Ran a command, created a file, read a file Ran a command, created a file, read a file Good. Now for the `bridge.py` changes — here are the **exact three insertions** you need to make. Since I can't see the current file I'll give you find/replace targets that should be unambiguous: --- ### bridge.py — Change 1: Add import at the top Find the block of imports at the top (near `import json`, `import re` etc). Add after the last import line: python ```python from embeddings import ( EmbeddingRegistry, LiveEmbeddingAccumulator, log_transcript_segment, ) ``` --- ### bridge.py — Change 2: Add registry + accumulator tracking to `BridgeState.__init__` Find: python ```python self._last_final_time = time.monotonic() ``` Add immediately after it: python ```python # Voiceprint matching self._embedding_registry = EmbeddingRegistry() # Per-speaker accumulators: speaker_id -> LiveEmbeddingAccumulator self._accumulators: dict[str, LiveEmbeddingAccumulator] = {} # Speaker IDs confirmed this session (skip re-matching) self._confirmed_ids: set[str] = set() # Session ID for transcript logging (date string) self._session_id: str = datetime.now(timezone.utc).strftime("%Y-%m-%d") ``` You'll also need to add `from datetime import datetime, timezone` to the imports if it's not already there. --- ### bridge.py — Change 3: Add segment logging + embedding accumulation to `push_final()` Find the line inside `push_final()` that reads something like: python ```python self._last_final_time = time.monotonic() ``` Just after that line, add: python ```python # ── Transcript segment logging ────────────────────────────────────────── # Log every finalised segment for later voiceprint enrolment # start_sec/end_sec from FrontData if available, else estimate from time seg_start = getattr(response, "start", None) seg_end = getattr(response, "end", None) if seg_start is None: seg_end = time.monotonic() - self._last_final_time seg_start = max(0.0, seg_end - len(text.split()) * 0.4) log_transcript_segment( speaker_id = speaker_id or "UNKNOWN", text = text, start_sec = seg_start or 0.0, end_sec = seg_end or 0.0, session_id = self._session_id, ) # ── Live embedding accumulation for auto speaker matching ─────────────── if speaker_id and speaker_id not in self._confirmed_ids: if speaker_id not in self._accumulators: self._accumulators[speaker_id] = \ self._embedding_registry.make_accumulator(min_seconds=5.0) # We don't have raw PCM here — accumulation happens in _send_audio # This is a placeholder; see Change 4 for the PCM-level accumulation ``` --- ### bridge.py — Change 4: PCM-level accumulation in `_send_audio` This is the most important change — accumulating actual audio per speaker. Find inside `audio_processor_loop` the `_send_audio` coroutine. Find the line that calls: python ```python await _audio_processor.process_audio(chunk) ``` Replace it with: python ```python await _audio_processor.process_audio(chunk) # Feed chunk to the accumulator for the current speaker # _current_speaker_id is set by _receive_results as segments arrive current_spk = state.get_current_speaker_id() if current_spk and current_spk not in state._confirmed_ids: if current_spk not in state._accumulators: state._accumulators[current_spk] = \ state._embedding_registry.make_accumulator(min_seconds=5.0) state._accumulators[current_spk].push(chunk) if state._accumulators[current_spk].ready(): try: live_emb = state._accumulators[current_spk].extract_embedding() match = state._embedding_registry.find_match(live_emb) if match: matched_id, score = match resolved_name = state.speaker_names.get(matched_id, matched_id) print( f"[Embeddings] Auto-matched {current_spk} → " f"{resolved_name} (score={score:.3f})" ) state.set_speaker_name(current_spk, resolved_name) state._confirmed_ids.add(current_spk) else: # No match — reset accumulator to try again with more audio state._accumulators[current_spk].reset() except Exception as exc: logger.warning(f"[Embeddings] Accumulator error: {exc}") ``` And add this method to `BridgeState`: python ```python def get_current_speaker_id(self) -> str | None: """Return the raw diarization ID (SPEAKER_XX) of the current speaker.""" with self._lock: return self._raw_speaker_id # see below ``` And add `self._raw_speaker_id: str | None = None` to `__init__`, then in `push_final()` add: python ```python self._raw_speaker_id = speaker_id # track raw ID for accumulator ``` at the top of the `with self._lock:` block. --- That's all four changes. The key file to copy into `bridge/` is `embeddings.py`. Here are the four precise changes: Change 1 — Imports (top of file) Find: python from collections import Counter from pathlib import Path Replace with: python from collections import Counter from datetime import datetime, timezone from pathlib import Path Then find: python from whisperlivekit import AudioProcessor, TranscriptionEngine Replace with: python from whisperlivekit import AudioProcessor, TranscriptionEngine from embeddings import ( EmbeddingRegistry, log_transcript_segment, ) Change 2 — BridgeState.__init__ Find: python self._last_final_time = time.monotonic() Replace with: python self._last_final_time = time.monotonic() # Raw diarization ID of current speaker (SPEAKER_XX) self._raw_speaker_id: str | None = None # Voiceprint matching self._embedding_registry = EmbeddingRegistry() self._accumulators: dict[str, object] = {} self._confirmed_ids: set[str] = set() self._session_id: str = datetime.now(timezone.utc).strftime("%Y-%m-%d") Change 3 — push_final() method Find: python def push_final(self, text: str, speaker_id: str | None, mqtt_client: mqtt.Client) -> None: """Accept a finalised segment; flush on sentence boundary or speaker change.""" with self._lock: if speaker_id: self._seen.add(speaker_id) resolved = self._resolve(speaker_id) Replace with: python def push_final(self, text: str, speaker_id: str | None, mqtt_client: mqtt.Client, seg_start: float = 0.0, seg_end: float = 0.0) -> None: """Accept a finalised segment; flush on sentence boundary or speaker change.""" with self._lock: # Track raw diarization ID for PCM accumulator self._raw_speaker_id = speaker_id if speaker_id: self._seen.add(speaker_id) # Log segment to transcript_segments.jsonl for later enrolment log_transcript_segment( speaker_id = speaker_id or "UNKNOWN", text = text, start_sec = seg_start, end_sec = seg_end, session_id = self._session_id, ) resolved = self._resolve(speaker_id) Change 4 — _receive_results() — pass timestamps to push_final Find: python async def _receive_results(): async for response in results_generator: # response is a FrontData dataclass, not a dict text = (getattr(response, "text", None) or getattr(response, "buffer_transcription", None) or "").strip() is_final = getattr(response, "is_final", False) or getattr(response, "end_of_segment", False) speaker = getattr(response, "speaker", None) if is_final and text: print(f"[Whisper] ({speaker or '?'}) {text}") state.push_final(text, speaker, mqtt_client) Replace with: python async def _receive_results(): async for response in results_generator: text = (getattr(response, "text", None) or getattr(response, "buffer_transcription", None) or "").strip() is_final = getattr(response, "is_final", False) or \ getattr(response, "end_of_segment", False) speaker = getattr(response, "speaker", None) # Grab timestamps if FrontData provides them seg_start = float(getattr(response, "start", 0.0) or 0.0) seg_end = float(getattr(response, "end", 0.0) or 0.0) if is_final and text: print(f"[Whisper] ({speaker or '?'}) {text}") state.push_final(text, speaker, mqtt_client, seg_start, seg_end) Change 5 — _send_audio() — PCM accumulation for live matching Find: python while True: # Drain test audio injection first if available try: chunk = test_audio_queue.get_nowait() except asyncio.QueueEmpty: chunk = await audio_queue.get() await audio_processor.process_audio(chunk) Replace with: python while True: # Drain test audio injection first if available try: chunk = test_audio_queue.get_nowait() except asyncio.QueueEmpty: chunk = await audio_queue.get() await audio_processor.process_audio(chunk) # Accumulate PCM for live speaker matching current_spk = state._raw_speaker_id if current_spk and current_spk not in state._confirmed_ids: if current_spk not in state._accumulators: state._accumulators[current_spk] = \ state._embedding_registry.make_accumulator(min_seconds=5.0) state._accumulators[current_spk].push(chunk) if state._accumulators[current_spk].ready(): try: live_emb = state._accumulators[current_spk].extract_embedding() match = state._embedding_registry.find_match(live_emb) if match: matched_id, score = match resolved = state.speaker_names.get(matched_id, matched_id) print( f"[Embeddings] Auto-matched {current_spk} → " f"{resolved} (score={score:.3f})" ) with state._lock: state.speaker_names[current_spk] = resolved state._confirmed_ids.add(current_spk) _write_speakers(state.speaker_names) else: # No match yet — reset and try again with more audio state._accumulators[current_spk].reset() except Exception as exc: print(f"[Embeddings] Accumulator error: {exc}") state._accumulators[current_spk].reset() That's all five changes. The dtype in audio_callback is currently float32 but the accumulator expects int16 — note that Change 5 feeds whatever chunk is into the accumulator. The LiveEmbeddingAccumulator.push() and extract_embedding() write a WAV file assuming int16, so you'll want to make sure the chunk is int16 before pushing. Add this line right before state._accumulators[current_spk].push(chunk): python # Convert float32 → int16 for the embedding accumulator chunk_i16 = (np.frombuffer(chunk, dtype=np.float32) * 32767).astype(np.int16).tobytes() state._accumulators[current_spk].push(chunk_i16) And change the .push(chunk) call to .push(chunk_i16). Once those are in, drop embeddings.py into the bridge/ folder alongside bridge.py and it will be importable. The transcript log will start filling up from the first run, ready for auto-enrol once you have a recording to pull segments from. What needs to be added: - A **Voiceprint** column to the speaker table with status badge - A **voiceprint modal** per speaker with three tabs: auto-enrol from transcript, manual segment picker, and similarity test - New API endpoints for all embedding operations - A **Voiceprints** summary card showing all enrolled speakers