|
|
@@ -0,0 +1,626 @@
|
|
|
+The README shows voice enrolment v2 is the next planned feature, and per-speaker voice sample upload already exists in admin.py. That's the right foundation to build on.
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+### Voiceprint / Speaker Embedding Design
|
|
|
+
|
|
|
+**To answer your core question first:** No, don't store the embedding in `speakers.json`. A speaker embedding is a float32 numpy array of 512 dimensions — about 2KB of binary data. JSON can technically hold it as a list of floats but it becomes unwieldy and slow to load. The clean approach is:
|
|
|
+
|
|
|
+```
|
|
|
+bridge/
|
|
|
+├── speakers.json ← names, metadata (text only, fast to load)
|
|
|
+└── embeddings/
|
|
|
+ ├── SPEAKER_00.npy ← numpy binary, one file per speaker
|
|
|
+ ├── SPEAKER_01.npy
|
|
|
+ └── SPEAKER_02.npy
|
|
|
+```
|
|
|
+
|
|
|
+When you later move to a remote DB, the `.npy` files become `BLOB` columns or vector embeddings in something like pgvector. The `speakers.json` structure stays identical — you just add an `"has_embedding": true` flag.
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+### Extracting a Voiceprint From a 10-Second Snip
|
|
|
+
|
|
|
+This is the right approach — far better than asking someone to "say a sentence". Here's the full picture:
|
|
|
+
|
|
|
+**The tool:** `pyannote.audio`'s `SpeakerEmbedding` pipeline — already in your venv as a dependency of diart. It takes an audio segment and returns a 512-dimensional embedding vector that acts as a voice fingerprint.
|
|
|
+
|
|
|
+**The process for extracting from an existing recording:**
|
|
|
+
|
|
|
+1. Operator opens admin.py, navigates to a speaker
|
|
|
+2. Selects a timestamp range from a test recording (e.g. "00:45 → 00:55")
|
|
|
+3. Admin server extracts that 10s segment using miniaudio or ffmpeg
|
|
|
+4. Passes it to `SpeakerEmbedding` pipeline
|
|
|
+5. Saves the resulting `.npy` file to `embeddings/SPEAKER_XX.npy`
|
|
|
+
|
|
|
+**At runtime during a service:**
|
|
|
+
|
|
|
+When diart assigns a `SPEAKER_XX` label, bridge.py checks if an embedding exists for any unmatched speaker ID. If the current speaker's embedding (accumulated over ~5 seconds of live speech) cosine-similarity matches a stored embedding above threshold (~0.82), it auto-assigns the name without operator input.
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+### The `speakers.json` Schema Change
|
|
|
+
|
|
|
+Extend it from the current simple `{id: name}` to:
|
|
|
+
|
|
|
+json
|
|
|
+
|
|
|
+```json
|
|
|
+{
|
|
|
+ "SPEAKER_00": {
|
|
|
+ "name": "Pastor John",
|
|
|
+ "has_embedding": true,
|
|
|
+ "embedding_updated": "2026-05-04T09:32:00",
|
|
|
+ "colour": "#2563eb",
|
|
|
+ "notes": "Senior pastor"
|
|
|
+ },
|
|
|
+ "SPEAKER_01": {
|
|
|
+ "name": "Mary (Reader)",
|
|
|
+ "has_embedding": false,
|
|
|
+ "embedding_updated": null,
|
|
|
+ "colour": "#16a34a",
|
|
|
+ "notes": ""
|
|
|
+ }
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+The bridge reads `has_embedding` to decide whether to attempt auto-matching. The actual vector lives in `embeddings/SPEAKER_XX.npy`.
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+### Implementation Plan
|
|
|
+
|
|
|
+This is a 3-part addition:
|
|
|
+
|
|
|
+**Part 1 —** `embeddings.py` (new file, ~80 lines) A small module that handles extraction, storage and matching:
|
|
|
+
|
|
|
+python
|
|
|
+
|
|
|
+```python
|
|
|
+# Core interface
|
|
|
+extract_embedding(audio_path, start_sec, end_sec) -> np.ndarray
|
|
|
+save_embedding(speaker_id, embedding)
|
|
|
+load_embedding(speaker_id) -> np.ndarray | None
|
|
|
+find_best_match(live_embedding, threshold=0.82) -> tuple[str, float] | None
|
|
|
+```
|
|
|
+
|
|
|
+**Part 2 — Admin UI additions** On each speaker row, add a "Extract from recording" button that lets the operator:
|
|
|
+
|
|
|
+- Pick an uploaded test recording from a dropdown
|
|
|
+- Set a start/end time (or use a simple "pick a 10s window" slider)
|
|
|
+- Click "Extract" — server processes it and saves the `.npy`
|
|
|
+- Row updates to show "✓ Voiceprint saved"
|
|
|
+
|
|
|
+**Part 3 — Bridge auto-matching** In `push_final()`, when a new unknown speaker appears, compare their accumulated audio embedding against all stored profiles before prompting the operator. If a match is found above threshold, auto-assign the name silently.
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+Want me to write `embeddings.py` first, then the admin UI changes, then the bridge matching logic? The `embeddings.py` module is the cleanest starting point — it's self-contained and we can test it standalone before wiring it into the UI.
|
|
|
+
|
|
|
+## You said: ok lets get teh embeddings.
|
|
|
+
|
|
|
+ok lets get teh embeddings.py under way
|
|
|
+
|
|
|
+3:30 PM
|
|
|
+
|
|
|
+## Claude responded: Let me check the current bridge structure first before writing it.
|
|
|
+
|
|
|
+Let me check the current bridge structure first before writing it.
|
|
|
+
|
|
|
+Searched the web
|
|
|
+
|
|
|
+Searched the web
|
|
|
+
|
|
|
+The repo is requiring login for raw file access now. No matter — I have enough context from the session and the README. Let me write the module.
|
|
|
+
|
|
|
+Viewed a file, created a file, read a file
|
|
|
+
|
|
|
+Viewed a file, created a file, read a file
|
|
|
+
|
|
|
+Here's what's in it and why each piece is designed the way it is.
|
|
|
+
|
|
|
+**Storage** — `.npy` files in `bridge/embeddings/`, one per speaker. `speakers.json` gains `has_embedding`, `embedding_updated`, `colour`, and `notes` fields but the schema is backward compatible — existing flat `{"SPEAKER_00": "Pastor John"}` entries are migrated transparently via `_normalise_speaker_entry()`.
|
|
|
+
|
|
|
+**Extraction** — `extract()` uses miniaudio (already in your venv) to pull a segment by start/end seconds, writes a temp WAV, feeds it to pyannote's `SpeakerEmbedding` pipeline, then deletes the temp file. The pipeline is lazy-loaded on first call so import is fast.
|
|
|
+
|
|
|
+`LiveEmbeddingAccumulator` — this is the piece that plugs into bridge.py. When diart announces a new `SPEAKER_XX`, bridge creates an accumulator, feeds it PCM chunks via `push()`, and once `ready()` returns True (5 seconds by default) calls `extract_embedding()` + `registry.find_match()`. If it matches an enrolled speaker, the name is auto-assigned without operator prompt.
|
|
|
+
|
|
|
+**CLI** — the file is executable standalone for testing without starting the full stack:
|
|
|
+
|
|
|
+bash
|
|
|
+
|
|
|
+```bash
|
|
|
+# Enrol from seconds 45–55 of last Sunday's recording
|
|
|
+python bridge\embeddings.py enrol SPEAKER_00 test_recordings\service.mp3 --start 45 --end 55
|
|
|
+
|
|
|
+# Check what it matched against all enrolled speakers
|
|
|
+python bridge\embeddings.py scores test_recordings\service.mp3 --start 45 --end 55
|
|
|
+
|
|
|
+# List who's enrolled
|
|
|
+python bridge\embeddings.py list
|
|
|
+```
|
|
|
+
|
|
|
+Next steps are the admin UI additions (a segment picker per speaker row) and wiring `LiveEmbeddingAccumulator` into `bridge.py`'s `push_final()` path.
|
|
|
+
|
|
|
+
|
|
|
+### What a Voiceprint Actually Is
|
|
|
+
|
|
|
+The embedding captures **acoustic characteristics** of a voice — things like fundamental frequency, formant patterns, speaking rhythm, vocal tract shape. It has nothing to do with the words spoken. So a person saying "the quick brown fox" and saying "let us pray" would produce very similar embeddings, because it's the _voice_ not the _words_ being fingerprinted.
|
|
|
+
|
|
|
+So transcription accuracy doesn't directly improve the embedding quality.
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+### What _Does_ Improve Embedding Quality
|
|
|
+
|
|
|
+The factors that matter are:
|
|
|
+
|
|
|
+| | |
|
|
|
+| --- | --- |
|
|
|
+| Factor | Impact |
|
|
|
+| **Audio cleanliness** | Highest — noise, reverb, music bleed all degrade it |
|
|
|
+| **Segment length** | 10–20s is the sweet spot; diminishing returns after 30s |
|
|
|
+| **Single speaker** | Must be isolated — no overlapping voices in the segment |
|
|
|
+| **Consistent mic distance** | Close-mic (pulpit mic) is better than room mic |
|
|
|
+| **Multiple samples averaged** | Enrolling 2–3 separate clips and averaging the embeddings is more robust than one clip |
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+### Where Transcription _Does_ Help — Indirectly
|
|
|
+
|
|
|
+Here's the indirect benefit you're probably intuiting:
|
|
|
+
|
|
|
+**If you know what was said, you can verify the segment is clean.** For example:
|
|
|
+
|
|
|
+- If the transcript shows `[SPEAKER_00] "Let us open in prayer..."` — you know that's isolated speech, no congregation response, no overlap. That's a good enrolment segment.
|
|
|
+- If the transcript shows `[SPEAKER_00] "[inaudible] ...mmm... [SPEAKER_01]"` — overlapping speakers, bad segment, skip it.
|
|
|
+
|
|
|
+So the workflow becomes:
|
|
|
+
|
|
|
+```
|
|
|
+1. Run a test recording through the pipeline
|
|
|
+2. Read the transcript output to find clean, isolated segments per speaker
|
|
|
+3. Note the timestamps of those clean segments
|
|
|
+4. Use embeddings.py enrol with those specific timestamps
|
|
|
+```
|
|
|
+
|
|
|
+This is actually a **much better enrolment strategy** than asking someone to record a sample — because the transcript lets you cherry-pick the cleanest real-world speech from an actual service.
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+### Practical Addition to embeddings.py
|
|
|
+
|
|
|
+We could add a `enrol_from_transcript()` helper that takes the transcript log (which bridge.py already produces) and automatically picks the best segment per speaker — longest clean run with no speaker change within it:
|
|
|
+
|
|
|
+python
|
|
|
+
|
|
|
+```python
|
|
|
+def enrol_from_transcript(
|
|
|
+ speaker_id: str,
|
|
|
+ audio_path: Path,
|
|
|
+ transcript_segments: list[dict], # [{speaker, start, end, text}, ...]
|
|
|
+ min_duration: float = 8.0,
|
|
|
+) -> np.ndarray:
|
|
|
+ """
|
|
|
+ Find the longest clean single-speaker segment in a transcript
|
|
|
+ and use it for enrolment. Avoids segments near speaker changes.
|
|
|
+ """
|
|
|
+ candidates = [
|
|
|
+ s for s in transcript_segments
|
|
|
+ if s["speaker"] == speaker_id
|
|
|
+ and (s["end"] - s["start"]) >= min_duration
|
|
|
+ ]
|
|
|
+ if not candidates:
|
|
|
+ raise ValueError(f"No clean segment >= {min_duration}s found for {speaker_id}")
|
|
|
+
|
|
|
+ # Pick longest segment
|
|
|
+ best = max(candidates, key=lambda s: s["end"] - s["start"])
|
|
|
+ return self.extract_and_save(speaker_id, audio_path, best["start"], best["end"])
|
|
|
+```
|
|
|
+
|
|
|
+This would mean after every test recording run, bridge.py logs timestamped segments, and the admin UI could show a "Use best segment for enrolment" button per speaker — fully automatic, no timestamp hunting needed.
|
|
|
+
|
|
|
+Want me to add that to `embeddings.py` and also add the transcript segment logging to `bridge.py` so the timestamps are captured?
|
|
|
+
|
|
|
+## You said: ok lets update them both let me know where i need to put the block into the bridge.
|
|
|
+
|
|
|
+ok lets update them both let me know where i need to put the block into the bridge.py and embeddings.py
|
|
|
+
|
|
|
+3:47 PM
|
|
|
+
|
|
|
+## Claude responded: Let me fetch the current bridge.
|
|
|
+
|
|
|
+Let me fetch the current bridge.py first so I place the changes exactly right.
|
|
|
+
|
|
|
+Failed to fetch: https://git.tazz.com.au/benjamin.harris/Deaf_Transcription_Service/src/main/bridge/bridge.py
|
|
|
+
|
|
|
+Failed to fetch: https://git.tazz.com.au/benjamin.harris/Deaf_Transcription_Service/src/main/bridge/bridge.py
|
|
|
+
|
|
|
+The repo is still requiring login for file access. Can you paste the current `bridge.py` content here? Specifically I need to see:
|
|
|
+
|
|
|
+1. The `push_final()` method in `BridgeState`
|
|
|
+2. The `_receive_results()` async function
|
|
|
+3. The `main()` function
|
|
|
+
|
|
|
+Those are the three places I need to insert code. While you grab those, here's the complete updated `embeddings.py` with the transcript-based enrolment added:
|
|
|
+
|
|
|
+Ran a command, created a file, read a file
|
|
|
+
|
|
|
+Ran a command, created a file, read a file
|
|
|
+
|
|
|
+Good. Now for the `bridge.py` changes — here are the **exact three insertions** you need to make. Since I can't see the current file I'll give you find/replace targets that should be unambiguous:
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+### bridge.py — Change 1: Add import at the top
|
|
|
+
|
|
|
+Find the block of imports at the top (near `import json`, `import re` etc). Add after the last import line:
|
|
|
+
|
|
|
+python
|
|
|
+
|
|
|
+```python
|
|
|
+from embeddings import (
|
|
|
+ EmbeddingRegistry,
|
|
|
+ LiveEmbeddingAccumulator,
|
|
|
+ log_transcript_segment,
|
|
|
+)
|
|
|
+```
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+### bridge.py — Change 2: Add registry + accumulator tracking to `BridgeState.__init__`
|
|
|
+
|
|
|
+Find:
|
|
|
+
|
|
|
+python
|
|
|
+
|
|
|
+```python
|
|
|
+self._last_final_time = time.monotonic()
|
|
|
+```
|
|
|
+
|
|
|
+Add immediately after it:
|
|
|
+
|
|
|
+python
|
|
|
+
|
|
|
+```python
|
|
|
+# Voiceprint matching
|
|
|
+self._embedding_registry = EmbeddingRegistry()
|
|
|
+# Per-speaker accumulators: speaker_id -> LiveEmbeddingAccumulator
|
|
|
+self._accumulators: dict[str, LiveEmbeddingAccumulator] = {}
|
|
|
+# Speaker IDs confirmed this session (skip re-matching)
|
|
|
+self._confirmed_ids: set[str] = set()
|
|
|
+# Session ID for transcript logging (date string)
|
|
|
+self._session_id: str = datetime.now(timezone.utc).strftime("%Y-%m-%d")
|
|
|
+```
|
|
|
+
|
|
|
+You'll also need to add `from datetime import datetime, timezone` to the imports if it's not already there.
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+### bridge.py — Change 3: Add segment logging + embedding accumulation to `push_final()`
|
|
|
+
|
|
|
+Find the line inside `push_final()` that reads something like:
|
|
|
+
|
|
|
+python
|
|
|
+
|
|
|
+```python
|
|
|
+self._last_final_time = time.monotonic()
|
|
|
+```
|
|
|
+
|
|
|
+Just after that line, add:
|
|
|
+
|
|
|
+python
|
|
|
+
|
|
|
+```python
|
|
|
+# ── Transcript segment logging ──────────────────────────────────────────
|
|
|
+# Log every finalised segment for later voiceprint enrolment
|
|
|
+# start_sec/end_sec from FrontData if available, else estimate from time
|
|
|
+seg_start = getattr(response, "start", None)
|
|
|
+seg_end = getattr(response, "end", None)
|
|
|
+if seg_start is None:
|
|
|
+ seg_end = time.monotonic() - self._last_final_time
|
|
|
+ seg_start = max(0.0, seg_end - len(text.split()) * 0.4)
|
|
|
+log_transcript_segment(
|
|
|
+ speaker_id = speaker_id or "UNKNOWN",
|
|
|
+ text = text,
|
|
|
+ start_sec = seg_start or 0.0,
|
|
|
+ end_sec = seg_end or 0.0,
|
|
|
+ session_id = self._session_id,
|
|
|
+)
|
|
|
+
|
|
|
+# ── Live embedding accumulation for auto speaker matching ───────────────
|
|
|
+if speaker_id and speaker_id not in self._confirmed_ids:
|
|
|
+ if speaker_id not in self._accumulators:
|
|
|
+ self._accumulators[speaker_id] = \
|
|
|
+ self._embedding_registry.make_accumulator(min_seconds=5.0)
|
|
|
+ # We don't have raw PCM here — accumulation happens in _send_audio
|
|
|
+ # This is a placeholder; see Change 4 for the PCM-level accumulation
|
|
|
+```
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+### bridge.py — Change 4: PCM-level accumulation in `_send_audio`
|
|
|
+
|
|
|
+This is the most important change — accumulating actual audio per speaker. Find inside `audio_processor_loop` the `_send_audio` coroutine. Find the line that calls:
|
|
|
+
|
|
|
+python
|
|
|
+
|
|
|
+```python
|
|
|
+await _audio_processor.process_audio(chunk)
|
|
|
+```
|
|
|
+
|
|
|
+Replace it with:
|
|
|
+
|
|
|
+python
|
|
|
+
|
|
|
+```python
|
|
|
+await _audio_processor.process_audio(chunk)
|
|
|
+
|
|
|
+# Feed chunk to the accumulator for the current speaker
|
|
|
+# _current_speaker_id is set by _receive_results as segments arrive
|
|
|
+current_spk = state.get_current_speaker_id()
|
|
|
+if current_spk and current_spk not in state._confirmed_ids:
|
|
|
+ if current_spk not in state._accumulators:
|
|
|
+ state._accumulators[current_spk] = \
|
|
|
+ state._embedding_registry.make_accumulator(min_seconds=5.0)
|
|
|
+ state._accumulators[current_spk].push(chunk)
|
|
|
+ if state._accumulators[current_spk].ready():
|
|
|
+ try:
|
|
|
+ live_emb = state._accumulators[current_spk].extract_embedding()
|
|
|
+ match = state._embedding_registry.find_match(live_emb)
|
|
|
+ if match:
|
|
|
+ matched_id, score = match
|
|
|
+ resolved_name = state.speaker_names.get(matched_id, matched_id)
|
|
|
+ print(
|
|
|
+ f"[Embeddings] Auto-matched {current_spk} → "
|
|
|
+ f"{resolved_name} (score={score:.3f})"
|
|
|
+ )
|
|
|
+ state.set_speaker_name(current_spk, resolved_name)
|
|
|
+ state._confirmed_ids.add(current_spk)
|
|
|
+ else:
|
|
|
+ # No match — reset accumulator to try again with more audio
|
|
|
+ state._accumulators[current_spk].reset()
|
|
|
+ except Exception as exc:
|
|
|
+ logger.warning(f"[Embeddings] Accumulator error: {exc}")
|
|
|
+```
|
|
|
+
|
|
|
+And add this method to `BridgeState`:
|
|
|
+
|
|
|
+python
|
|
|
+
|
|
|
+```python
|
|
|
+def get_current_speaker_id(self) -> str | None:
|
|
|
+ """Return the raw diarization ID (SPEAKER_XX) of the current speaker."""
|
|
|
+ with self._lock:
|
|
|
+ return self._raw_speaker_id # see below
|
|
|
+```
|
|
|
+
|
|
|
+And add `self._raw_speaker_id: str | None = None` to `__init__`, then in `push_final()` add:
|
|
|
+
|
|
|
+python
|
|
|
+
|
|
|
+```python
|
|
|
+self._raw_speaker_id = speaker_id # track raw ID for accumulator
|
|
|
+```
|
|
|
+
|
|
|
+at the top of the `with self._lock:` block.
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+That's all four changes. The key file to copy into `bridge/` is `embeddings.py`.
|
|
|
+
|
|
|
+
|
|
|
+Here are the four precise changes:
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+Change 1 — Imports (top of file)
|
|
|
+
|
|
|
+Find:
|
|
|
+
|
|
|
+python
|
|
|
+
|
|
|
+from collections import Counter
|
|
|
+from pathlib import Path
|
|
|
+
|
|
|
+Replace with:
|
|
|
+
|
|
|
+python
|
|
|
+
|
|
|
+from collections import Counter
|
|
|
+from datetime import datetime, timezone
|
|
|
+from pathlib import Path
|
|
|
+
|
|
|
+Then find:
|
|
|
+
|
|
|
+python
|
|
|
+
|
|
|
+from whisperlivekit import AudioProcessor, TranscriptionEngine
|
|
|
+
|
|
|
+Replace with:
|
|
|
+
|
|
|
+python
|
|
|
+
|
|
|
+from whisperlivekit import AudioProcessor, TranscriptionEngine
|
|
|
+from embeddings import (
|
|
|
+ EmbeddingRegistry,
|
|
|
+ log_transcript_segment,
|
|
|
+)
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+Change 2 — BridgeState.__init__
|
|
|
+
|
|
|
+Find:
|
|
|
+
|
|
|
+python
|
|
|
+
|
|
|
+ self._last_final_time = time.monotonic()
|
|
|
+
|
|
|
+Replace with:
|
|
|
+
|
|
|
+python
|
|
|
+
|
|
|
+ self._last_final_time = time.monotonic()
|
|
|
+ # Raw diarization ID of current speaker (SPEAKER_XX)
|
|
|
+ self._raw_speaker_id: str | None = None
|
|
|
+ # Voiceprint matching
|
|
|
+ self._embedding_registry = EmbeddingRegistry()
|
|
|
+ self._accumulators: dict[str, object] = {}
|
|
|
+ self._confirmed_ids: set[str] = set()
|
|
|
+ self._session_id: str = datetime.now(timezone.utc).strftime("%Y-%m-%d")
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+Change 3 — push_final() method
|
|
|
+
|
|
|
+Find:
|
|
|
+
|
|
|
+python
|
|
|
+
|
|
|
+ def push_final(self, text: str, speaker_id: str | None, mqtt_client: mqtt.Client) -> None:
|
|
|
+ """Accept a finalised segment; flush on sentence boundary or speaker change."""
|
|
|
+ with self._lock:
|
|
|
+ if speaker_id:
|
|
|
+ self._seen.add(speaker_id)
|
|
|
+
|
|
|
+ resolved = self._resolve(speaker_id)
|
|
|
+
|
|
|
+Replace with:
|
|
|
+
|
|
|
+python
|
|
|
+
|
|
|
+ def push_final(self, text: str, speaker_id: str | None, mqtt_client: mqtt.Client,
|
|
|
+ seg_start: float = 0.0, seg_end: float = 0.0) -> None:
|
|
|
+ """Accept a finalised segment; flush on sentence boundary or speaker change."""
|
|
|
+ with self._lock:
|
|
|
+ # Track raw diarization ID for PCM accumulator
|
|
|
+ self._raw_speaker_id = speaker_id
|
|
|
+
|
|
|
+ if speaker_id:
|
|
|
+ self._seen.add(speaker_id)
|
|
|
+
|
|
|
+ # Log segment to transcript_segments.jsonl for later enrolment
|
|
|
+ log_transcript_segment(
|
|
|
+ speaker_id = speaker_id or "UNKNOWN",
|
|
|
+ text = text,
|
|
|
+ start_sec = seg_start,
|
|
|
+ end_sec = seg_end,
|
|
|
+ session_id = self._session_id,
|
|
|
+ )
|
|
|
+
|
|
|
+ resolved = self._resolve(speaker_id)
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+Change 4 — _receive_results() — pass timestamps to push_final
|
|
|
+
|
|
|
+Find:
|
|
|
+
|
|
|
+python
|
|
|
+
|
|
|
+ async def _receive_results():
|
|
|
+ async for response in results_generator:
|
|
|
+ # response is a FrontData dataclass, not a dict
|
|
|
+ text = (getattr(response, "text", None) or getattr(response, "buffer_transcription", None) or "").strip()
|
|
|
+ is_final = getattr(response, "is_final", False) or getattr(response, "end_of_segment", False)
|
|
|
+ speaker = getattr(response, "speaker", None)
|
|
|
+ if is_final and text:
|
|
|
+ print(f"[Whisper] ({speaker or '?'}) {text}")
|
|
|
+ state.push_final(text, speaker, mqtt_client)
|
|
|
+
|
|
|
+Replace with:
|
|
|
+
|
|
|
+python
|
|
|
+
|
|
|
+ async def _receive_results():
|
|
|
+ async for response in results_generator:
|
|
|
+ text = (getattr(response, "text", None) or
|
|
|
+ getattr(response, "buffer_transcription", None) or "").strip()
|
|
|
+ is_final = getattr(response, "is_final", False) or \
|
|
|
+ getattr(response, "end_of_segment", False)
|
|
|
+ speaker = getattr(response, "speaker", None)
|
|
|
+ # Grab timestamps if FrontData provides them
|
|
|
+ seg_start = float(getattr(response, "start", 0.0) or 0.0)
|
|
|
+ seg_end = float(getattr(response, "end", 0.0) or 0.0)
|
|
|
+ if is_final and text:
|
|
|
+ print(f"[Whisper] ({speaker or '?'}) {text}")
|
|
|
+ state.push_final(text, speaker, mqtt_client, seg_start, seg_end)
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+Change 5 — _send_audio() — PCM accumulation for live matching
|
|
|
+
|
|
|
+Find:
|
|
|
+
|
|
|
+python
|
|
|
+
|
|
|
+ while True:
|
|
|
+ # Drain test audio injection first if available
|
|
|
+ try:
|
|
|
+ chunk = test_audio_queue.get_nowait()
|
|
|
+ except asyncio.QueueEmpty:
|
|
|
+ chunk = await audio_queue.get()
|
|
|
+ await audio_processor.process_audio(chunk)
|
|
|
+
|
|
|
+Replace with:
|
|
|
+
|
|
|
+python
|
|
|
+
|
|
|
+ while True:
|
|
|
+ # Drain test audio injection first if available
|
|
|
+ try:
|
|
|
+ chunk = test_audio_queue.get_nowait()
|
|
|
+ except asyncio.QueueEmpty:
|
|
|
+ chunk = await audio_queue.get()
|
|
|
+ await audio_processor.process_audio(chunk)
|
|
|
+
|
|
|
+ # Accumulate PCM for live speaker matching
|
|
|
+ current_spk = state._raw_speaker_id
|
|
|
+ if current_spk and current_spk not in state._confirmed_ids:
|
|
|
+ if current_spk not in state._accumulators:
|
|
|
+ state._accumulators[current_spk] = \
|
|
|
+ state._embedding_registry.make_accumulator(min_seconds=5.0)
|
|
|
+ state._accumulators[current_spk].push(chunk)
|
|
|
+ if state._accumulators[current_spk].ready():
|
|
|
+ try:
|
|
|
+ live_emb = state._accumulators[current_spk].extract_embedding()
|
|
|
+ match = state._embedding_registry.find_match(live_emb)
|
|
|
+ if match:
|
|
|
+ matched_id, score = match
|
|
|
+ resolved = state.speaker_names.get(matched_id, matched_id)
|
|
|
+ print(
|
|
|
+ f"[Embeddings] Auto-matched {current_spk} → "
|
|
|
+ f"{resolved} (score={score:.3f})"
|
|
|
+ )
|
|
|
+ with state._lock:
|
|
|
+ state.speaker_names[current_spk] = resolved
|
|
|
+ state._confirmed_ids.add(current_spk)
|
|
|
+ _write_speakers(state.speaker_names)
|
|
|
+ else:
|
|
|
+ # No match yet — reset and try again with more audio
|
|
|
+ state._accumulators[current_spk].reset()
|
|
|
+ except Exception as exc:
|
|
|
+ print(f"[Embeddings] Accumulator error: {exc}")
|
|
|
+ state._accumulators[current_spk].reset()
|
|
|
+
|
|
|
+
|
|
|
+
|
|
|
+That's all five changes. The dtype in audio_callback is currently float32 but the accumulator expects int16 — note that Change 5 feeds whatever chunk is into the accumulator. The LiveEmbeddingAccumulator.push() and extract_embedding() write a WAV file assuming int16, so you'll want to make sure the chunk is int16 before pushing. Add this line right before state._accumulators[current_spk].push(chunk):
|
|
|
+
|
|
|
+python
|
|
|
+
|
|
|
+ # Convert float32 → int16 for the embedding accumulator
|
|
|
+ chunk_i16 = (np.frombuffer(chunk, dtype=np.float32) * 32767).astype(np.int16).tobytes()
|
|
|
+ state._accumulators[current_spk].push(chunk_i16)
|
|
|
+
|
|
|
+And change the .push(chunk) call to .push(chunk_i16).
|
|
|
+
|
|
|
+Once those are in, drop embeddings.py into the bridge/ folder alongside bridge.py and it will be importable. The transcript log will start filling up from the first run, ready for auto-enrol once you have a recording to pull segments from.
|
|
|
+
|
|
|
+What needs to be added:
|
|
|
+
|
|
|
+- A **Voiceprint** column to the speaker table with status badge
|
|
|
+- A **voiceprint modal** per speaker with three tabs: auto-enrol from transcript, manual segment picker, and similarity test
|
|
|
+- New API endpoints for all embedding operations
|
|
|
+- A **Voiceprints** summary card showing all enrolled speakers
|