Benjamin Harris 1 månad sedan
förälder
incheckning
980e1df655
1 ändrade filer med 626 tillägg och 0 borttagningar
  1. 626 0
      Embedding.md

+ 626 - 0
Embedding.md

@@ -0,0 +1,626 @@
+The README shows voice enrolment v2 is the next planned feature, and per-speaker voice sample upload already exists in admin.py. That's the right foundation to build on.
+
+---
+
+### Voiceprint / Speaker Embedding Design
+
+**To answer your core question first:** No, don't store the embedding in `speakers.json`. A speaker embedding is a float32 numpy array of 512 dimensions — about 2KB of binary data. JSON can technically hold it as a list of floats but it becomes unwieldy and slow to load. The clean approach is:
+
+```
+bridge/
+├── speakers.json          ← names, metadata (text only, fast to load)
+└── embeddings/
+    ├── SPEAKER_00.npy     ← numpy binary, one file per speaker
+    ├── SPEAKER_01.npy
+    └── SPEAKER_02.npy
+```
+
+When you later move to a remote DB, the `.npy` files become `BLOB` columns or vector embeddings in something like pgvector. The `speakers.json` structure stays identical — you just add an `"has_embedding": true` flag.
+
+---
+
+### Extracting a Voiceprint From a 10-Second Snip
+
+This is the right approach — far better than asking someone to "say a sentence". Here's the full picture:
+
+**The tool:** `pyannote.audio`'s `SpeakerEmbedding` pipeline — already in your venv as a dependency of diart. It takes an audio segment and returns a 512-dimensional embedding vector that acts as a voice fingerprint.
+
+**The process for extracting from an existing recording:**
+
+1. Operator opens admin.py, navigates to a speaker
+2. Selects a timestamp range from a test recording (e.g. "00:45 → 00:55")
+3. Admin server extracts that 10s segment using miniaudio or ffmpeg
+4. Passes it to `SpeakerEmbedding` pipeline
+5. Saves the resulting `.npy` file to `embeddings/SPEAKER_XX.npy`
+
+**At runtime during a service:**
+
+When diart assigns a `SPEAKER_XX` label, bridge.py checks if an embedding exists for any unmatched speaker ID. If the current speaker's embedding (accumulated over ~5 seconds of live speech) cosine-similarity matches a stored embedding above threshold (~0.82), it auto-assigns the name without operator input.
+
+---
+
+### The `speakers.json` Schema Change
+
+Extend it from the current simple `{id: name}` to:
+
+json
+
+```json
+{
+  "SPEAKER_00": {
+    "name": "Pastor John",
+    "has_embedding": true,
+    "embedding_updated": "2026-05-04T09:32:00",
+    "colour": "#2563eb",
+    "notes": "Senior pastor"
+  },
+  "SPEAKER_01": {
+    "name": "Mary (Reader)",
+    "has_embedding": false,
+    "embedding_updated": null,
+    "colour": "#16a34a",
+    "notes": ""
+  }
+}
+```
+
+The bridge reads `has_embedding` to decide whether to attempt auto-matching. The actual vector lives in `embeddings/SPEAKER_XX.npy`.
+
+---
+
+### Implementation Plan
+
+This is a 3-part addition:
+
+**Part 1 —** `embeddings.py` (new file, ~80 lines) A small module that handles extraction, storage and matching:
+
+python
+
+```python
+# Core interface
+extract_embedding(audio_path, start_sec, end_sec) -> np.ndarray
+save_embedding(speaker_id, embedding)
+load_embedding(speaker_id) -> np.ndarray | None
+find_best_match(live_embedding, threshold=0.82) -> tuple[str, float] | None
+```
+
+**Part 2 — Admin UI additions** On each speaker row, add a "Extract from recording" button that lets the operator:
+
+- Pick an uploaded test recording from a dropdown
+- Set a start/end time (or use a simple "pick a 10s window" slider)
+- Click "Extract" — server processes it and saves the `.npy`
+- Row updates to show "✓ Voiceprint saved"
+
+**Part 3 — Bridge auto-matching** In `push_final()`, when a new unknown speaker appears, compare their accumulated audio embedding against all stored profiles before prompting the operator. If a match is found above threshold, auto-assign the name silently.
+
+---
+
+Want me to write `embeddings.py` first, then the admin UI changes, then the bridge matching logic? The `embeddings.py` module is the cleanest starting point — it's self-contained and we can test it standalone before wiring it into the UI.
+
+## You said: ok lets get teh embeddings.
+
+ok lets get teh embeddings.py under way
+
+3:30 PM
+
+## Claude responded: Let me check the current bridge structure first before writing it.
+
+Let me check the current bridge structure first before writing it.
+
+Searched the web
+
+Searched the web
+
+The repo is requiring login for raw file access now. No matter — I have enough context from the session and the README. Let me write the module.
+
+Viewed a file, created a file, read a file
+
+Viewed a file, created a file, read a file
+
+Here's what's in it and why each piece is designed the way it is.
+
+**Storage** — `.npy` files in `bridge/embeddings/`, one per speaker. `speakers.json` gains `has_embedding`, `embedding_updated`, `colour`, and `notes` fields but the schema is backward compatible — existing flat `{"SPEAKER_00": "Pastor John"}` entries are migrated transparently via `_normalise_speaker_entry()`.
+
+**Extraction** — `extract()` uses miniaudio (already in your venv) to pull a segment by start/end seconds, writes a temp WAV, feeds it to pyannote's `SpeakerEmbedding` pipeline, then deletes the temp file. The pipeline is lazy-loaded on first call so import is fast.
+
+`LiveEmbeddingAccumulator` — this is the piece that plugs into bridge.py. When diart announces a new `SPEAKER_XX`, bridge creates an accumulator, feeds it PCM chunks via `push()`, and once `ready()` returns True (5 seconds by default) calls `extract_embedding()` + `registry.find_match()`. If it matches an enrolled speaker, the name is auto-assigned without operator prompt.
+
+**CLI** — the file is executable standalone for testing without starting the full stack:
+
+bash
+
+```bash
+# Enrol from seconds 45–55 of last Sunday's recording
+python bridge\embeddings.py enrol SPEAKER_00 test_recordings\service.mp3 --start 45 --end 55
+
+# Check what it matched against all enrolled speakers
+python bridge\embeddings.py scores test_recordings\service.mp3 --start 45 --end 55
+
+# List who's enrolled
+python bridge\embeddings.py list
+```
+
+Next steps are the admin UI additions (a segment picker per speaker row) and wiring `LiveEmbeddingAccumulator` into `bridge.py`'s `push_final()` path.
+
+
+### What a Voiceprint Actually Is
+
+The embedding captures **acoustic characteristics** of a voice — things like fundamental frequency, formant patterns, speaking rhythm, vocal tract shape. It has nothing to do with the words spoken. So a person saying "the quick brown fox" and saying "let us pray" would produce very similar embeddings, because it's the _voice_ not the _words_ being fingerprinted.
+
+So transcription accuracy doesn't directly improve the embedding quality.
+
+---
+
+### What _Does_ Improve Embedding Quality
+
+The factors that matter are:
+
+|     |     |
+| --- | --- |
+| Factor | Impact |
+| **Audio cleanliness** | Highest — noise, reverb, music bleed all degrade it |
+| **Segment length** | 10–20s is the sweet spot; diminishing returns after 30s |
+| **Single speaker** | Must be isolated — no overlapping voices in the segment |
+| **Consistent mic distance** | Close-mic (pulpit mic) is better than room mic |
+| **Multiple samples averaged** | Enrolling 2–3 separate clips and averaging the embeddings is more robust than one clip |
+
+---
+
+### Where Transcription _Does_ Help — Indirectly
+
+Here's the indirect benefit you're probably intuiting:
+
+**If you know what was said, you can verify the segment is clean.** For example:
+
+- If the transcript shows `[SPEAKER_00] "Let us open in prayer..."` — you know that's isolated speech, no congregation response, no overlap. That's a good enrolment segment.
+- If the transcript shows `[SPEAKER_00] "[inaudible] ...mmm... [SPEAKER_01]"` — overlapping speakers, bad segment, skip it.
+
+So the workflow becomes:
+
+```
+1. Run a test recording through the pipeline
+2. Read the transcript output to find clean, isolated segments per speaker
+3. Note the timestamps of those clean segments
+4. Use embeddings.py enrol with those specific timestamps
+```
+
+This is actually a **much better enrolment strategy** than asking someone to record a sample — because the transcript lets you cherry-pick the cleanest real-world speech from an actual service.
+
+---
+
+### Practical Addition to embeddings.py
+
+We could add a `enrol_from_transcript()` helper that takes the transcript log (which bridge.py already produces) and automatically picks the best segment per speaker — longest clean run with no speaker change within it:
+
+python
+
+```python
+def enrol_from_transcript(
+    speaker_id: str,
+    audio_path: Path,
+    transcript_segments: list[dict],  # [{speaker, start, end, text}, ...]
+    min_duration: float = 8.0,
+) -> np.ndarray:
+    """
+    Find the longest clean single-speaker segment in a transcript
+    and use it for enrolment. Avoids segments near speaker changes.
+    """
+    candidates = [
+        s for s in transcript_segments
+        if s["speaker"] == speaker_id
+        and (s["end"] - s["start"]) >= min_duration
+    ]
+    if not candidates:
+        raise ValueError(f"No clean segment >= {min_duration}s found for {speaker_id}")
+    
+    # Pick longest segment
+    best = max(candidates, key=lambda s: s["end"] - s["start"])
+    return self.extract_and_save(speaker_id, audio_path, best["start"], best["end"])
+```
+
+This would mean after every test recording run, bridge.py logs timestamped segments, and the admin UI could show a "Use best segment for enrolment" button per speaker — fully automatic, no timestamp hunting needed.
+
+Want me to add that to `embeddings.py` and also add the transcript segment logging to `bridge.py` so the timestamps are captured?
+
+## You said: ok lets update them both let me know where i need to put the block into the bridge.
+
+ok lets update them both let me know where i need to put the block into the bridge.py and embeddings.py
+
+3:47 PM
+
+## Claude responded: Let me fetch the current bridge.
+
+Let me fetch the current bridge.py first so I place the changes exactly right.
+
+Failed to fetch: https://git.tazz.com.au/benjamin.harris/Deaf_Transcription_Service/src/main/bridge/bridge.py
+
+Failed to fetch: https://git.tazz.com.au/benjamin.harris/Deaf_Transcription_Service/src/main/bridge/bridge.py
+
+The repo is still requiring login for file access. Can you paste the current `bridge.py` content here? Specifically I need to see:
+
+1. The `push_final()` method in `BridgeState`
+2. The `_receive_results()` async function
+3. The `main()` function
+
+Those are the three places I need to insert code. While you grab those, here's the complete updated `embeddings.py` with the transcript-based enrolment added:
+
+Ran a command, created a file, read a file
+
+Ran a command, created a file, read a file
+
+Good. Now for the `bridge.py` changes — here are the **exact three insertions** you need to make. Since I can't see the current file I'll give you find/replace targets that should be unambiguous:
+
+---
+
+### bridge.py — Change 1: Add import at the top
+
+Find the block of imports at the top (near `import json`, `import re` etc). Add after the last import line:
+
+python
+
+```python
+from embeddings import (
+    EmbeddingRegistry,
+    LiveEmbeddingAccumulator,
+    log_transcript_segment,
+)
+```
+
+---
+
+### bridge.py — Change 2: Add registry + accumulator tracking to `BridgeState.__init__`
+
+Find:
+
+python
+
+```python
+self._last_final_time = time.monotonic()
+```
+
+Add immediately after it:
+
+python
+
+```python
+# Voiceprint matching
+self._embedding_registry = EmbeddingRegistry()
+# Per-speaker accumulators: speaker_id -> LiveEmbeddingAccumulator
+self._accumulators: dict[str, LiveEmbeddingAccumulator] = {}
+# Speaker IDs confirmed this session (skip re-matching)
+self._confirmed_ids: set[str] = set()
+# Session ID for transcript logging (date string)
+self._session_id: str = datetime.now(timezone.utc).strftime("%Y-%m-%d")
+```
+
+You'll also need to add `from datetime import datetime, timezone` to the imports if it's not already there.
+
+---
+
+### bridge.py — Change 3: Add segment logging + embedding accumulation to `push_final()`
+
+Find the line inside `push_final()` that reads something like:
+
+python
+
+```python
+self._last_final_time = time.monotonic()
+```
+
+Just after that line, add:
+
+python
+
+```python
+# ── Transcript segment logging ──────────────────────────────────────────
+# Log every finalised segment for later voiceprint enrolment
+# start_sec/end_sec from FrontData if available, else estimate from time
+seg_start = getattr(response, "start", None)
+seg_end   = getattr(response, "end",   None)
+if seg_start is None:
+    seg_end   = time.monotonic() - self._last_final_time
+    seg_start = max(0.0, seg_end - len(text.split()) * 0.4)
+log_transcript_segment(
+    speaker_id = speaker_id or "UNKNOWN",
+    text       = text,
+    start_sec  = seg_start or 0.0,
+    end_sec    = seg_end   or 0.0,
+    session_id = self._session_id,
+)
+
+# ── Live embedding accumulation for auto speaker matching ───────────────
+if speaker_id and speaker_id not in self._confirmed_ids:
+    if speaker_id not in self._accumulators:
+        self._accumulators[speaker_id] = \
+            self._embedding_registry.make_accumulator(min_seconds=5.0)
+    # We don't have raw PCM here — accumulation happens in _send_audio
+    # This is a placeholder; see Change 4 for the PCM-level accumulation
+```
+
+---
+
+### bridge.py — Change 4: PCM-level accumulation in `_send_audio`
+
+This is the most important change — accumulating actual audio per speaker. Find inside `audio_processor_loop` the `_send_audio` coroutine. Find the line that calls:
+
+python
+
+```python
+await _audio_processor.process_audio(chunk)
+```
+
+Replace it with:
+
+python
+
+```python
+await _audio_processor.process_audio(chunk)
+
+# Feed chunk to the accumulator for the current speaker
+# _current_speaker_id is set by _receive_results as segments arrive
+current_spk = state.get_current_speaker_id()
+if current_spk and current_spk not in state._confirmed_ids:
+    if current_spk not in state._accumulators:
+        state._accumulators[current_spk] = \
+            state._embedding_registry.make_accumulator(min_seconds=5.0)
+    state._accumulators[current_spk].push(chunk)
+    if state._accumulators[current_spk].ready():
+        try:
+            live_emb = state._accumulators[current_spk].extract_embedding()
+            match    = state._embedding_registry.find_match(live_emb)
+            if match:
+                matched_id, score = match
+                resolved_name = state.speaker_names.get(matched_id, matched_id)
+                print(
+                    f"[Embeddings] Auto-matched {current_spk} → "
+                    f"{resolved_name} (score={score:.3f})"
+                )
+                state.set_speaker_name(current_spk, resolved_name)
+                state._confirmed_ids.add(current_spk)
+            else:
+                # No match — reset accumulator to try again with more audio
+                state._accumulators[current_spk].reset()
+        except Exception as exc:
+            logger.warning(f"[Embeddings] Accumulator error: {exc}")
+```
+
+And add this method to `BridgeState`:
+
+python
+
+```python
+def get_current_speaker_id(self) -> str | None:
+    """Return the raw diarization ID (SPEAKER_XX) of the current speaker."""
+    with self._lock:
+        return self._raw_speaker_id  # see below
+```
+
+And add `self._raw_speaker_id: str | None = None` to `__init__`, then in `push_final()` add:
+
+python
+
+```python
+self._raw_speaker_id = speaker_id  # track raw ID for accumulator
+```
+
+at the top of the `with self._lock:` block.
+
+---
+
+That's all four changes. The key file to copy into `bridge/` is `embeddings.py`.
+
+
+Here are the four precise changes:
+
+
+
+Change 1 — Imports (top of file)
+
+Find:
+
+python
+
+from collections import Counter
+from pathlib import Path
+
+Replace with:
+
+python
+
+from collections import Counter
+from datetime import datetime, timezone
+from pathlib import Path
+
+Then find:
+
+python
+
+from whisperlivekit import AudioProcessor, TranscriptionEngine
+
+Replace with:
+
+python
+
+from whisperlivekit import AudioProcessor, TranscriptionEngine
+from embeddings import (
+    EmbeddingRegistry,
+    log_transcript_segment,
+)
+
+
+
+Change 2 — BridgeState.__init__
+
+Find:
+
+python
+
+        self._last_final_time              = time.monotonic()
+
+Replace with:
+
+python
+
+        self._last_final_time              = time.monotonic()
+        # Raw diarization ID of current speaker (SPEAKER_XX)
+        self._raw_speaker_id: str | None   = None
+        # Voiceprint matching
+        self._embedding_registry           = EmbeddingRegistry()
+        self._accumulators: dict[str, object] = {}
+        self._confirmed_ids: set[str]      = set()
+        self._session_id: str              = datetime.now(timezone.utc).strftime("%Y-%m-%d")
+
+
+
+Change 3 — push_final() method
+
+Find:
+
+python
+
+    def push_final(self, text: str, speaker_id: str | None, mqtt_client: mqtt.Client) -> None:
+        """Accept a finalised segment; flush on sentence boundary or speaker change."""
+        with self._lock:
+            if speaker_id:
+                self._seen.add(speaker_id)
+
+            resolved = self._resolve(speaker_id)
+
+Replace with:
+
+python
+
+    def push_final(self, text: str, speaker_id: str | None, mqtt_client: mqtt.Client,
+                   seg_start: float = 0.0, seg_end: float = 0.0) -> None:
+        """Accept a finalised segment; flush on sentence boundary or speaker change."""
+        with self._lock:
+            # Track raw diarization ID for PCM accumulator
+            self._raw_speaker_id = speaker_id
+
+            if speaker_id:
+                self._seen.add(speaker_id)
+
+            # Log segment to transcript_segments.jsonl for later enrolment
+            log_transcript_segment(
+                speaker_id = speaker_id or "UNKNOWN",
+                text       = text,
+                start_sec  = seg_start,
+                end_sec    = seg_end,
+                session_id = self._session_id,
+            )
+
+            resolved = self._resolve(speaker_id)
+
+
+
+Change 4 — _receive_results() — pass timestamps to push_final
+
+Find:
+
+python
+
+    async def _receive_results():
+        async for response in results_generator:
+            # response is a FrontData dataclass, not a dict
+            text     = (getattr(response, "text", None) or getattr(response, "buffer_transcription", None) or "").strip()
+            is_final = getattr(response, "is_final", False) or getattr(response, "end_of_segment", False)
+            speaker  = getattr(response, "speaker", None)
+            if is_final and text:
+                print(f"[Whisper] ({speaker or '?'}) {text}")
+                state.push_final(text, speaker, mqtt_client)
+
+Replace with:
+
+python
+
+    async def _receive_results():
+        async for response in results_generator:
+            text     = (getattr(response, "text", None) or
+                        getattr(response, "buffer_transcription", None) or "").strip()
+            is_final = getattr(response, "is_final", False) or \
+                       getattr(response, "end_of_segment", False)
+            speaker  = getattr(response, "speaker", None)
+            # Grab timestamps if FrontData provides them
+            seg_start = float(getattr(response, "start", 0.0) or 0.0)
+            seg_end   = float(getattr(response, "end",   0.0) or 0.0)
+            if is_final and text:
+                print(f"[Whisper] ({speaker or '?'}) {text}")
+                state.push_final(text, speaker, mqtt_client, seg_start, seg_end)
+
+
+
+Change 5 — _send_audio() — PCM accumulation for live matching
+
+Find:
+
+python
+
+            while True:
+                # Drain test audio injection first if available
+                try:
+                    chunk = test_audio_queue.get_nowait()
+                except asyncio.QueueEmpty:
+                    chunk = await audio_queue.get()
+                await audio_processor.process_audio(chunk)
+
+Replace with:
+
+python
+
+            while True:
+                # Drain test audio injection first if available
+                try:
+                    chunk = test_audio_queue.get_nowait()
+                except asyncio.QueueEmpty:
+                    chunk = await audio_queue.get()
+                await audio_processor.process_audio(chunk)
+
+                # Accumulate PCM for live speaker matching
+                current_spk = state._raw_speaker_id
+                if current_spk and current_spk not in state._confirmed_ids:
+                    if current_spk not in state._accumulators:
+                        state._accumulators[current_spk] = \
+                            state._embedding_registry.make_accumulator(min_seconds=5.0)
+                    state._accumulators[current_spk].push(chunk)
+                    if state._accumulators[current_spk].ready():
+                        try:
+                            live_emb = state._accumulators[current_spk].extract_embedding()
+                            match    = state._embedding_registry.find_match(live_emb)
+                            if match:
+                                matched_id, score = match
+                                resolved = state.speaker_names.get(matched_id, matched_id)
+                                print(
+                                    f"[Embeddings] Auto-matched {current_spk} → "
+                                    f"{resolved} (score={score:.3f})"
+                                )
+                                with state._lock:
+                                    state.speaker_names[current_spk] = resolved
+                                    state._confirmed_ids.add(current_spk)
+                                    _write_speakers(state.speaker_names)
+                            else:
+                                # No match yet — reset and try again with more audio
+                                state._accumulators[current_spk].reset()
+                        except Exception as exc:
+                            print(f"[Embeddings] Accumulator error: {exc}")
+                            state._accumulators[current_spk].reset()
+
+
+
+That's all five changes. The dtype in audio_callback is currently float32 but the accumulator expects int16 — note that Change 5 feeds whatever chunk is into the accumulator. The LiveEmbeddingAccumulator.push() and extract_embedding() write a WAV file assuming int16, so you'll want to make sure the chunk is int16 before pushing. Add this line right before state._accumulators[current_spk].push(chunk):
+
+python
+
+                    # Convert float32 → int16 for the embedding accumulator
+                    chunk_i16 = (np.frombuffer(chunk, dtype=np.float32) * 32767).astype(np.int16).tobytes()
+                    state._accumulators[current_spk].push(chunk_i16)
+
+And change the .push(chunk) call to .push(chunk_i16).
+
+Once those are in, drop embeddings.py into the bridge/ folder alongside bridge.py and it will be importable. The transcript log will start filling up from the first run, ready for auto-enrol once you have a recording to pull segments from.
+
+What needs to be added:
+
+- A **Voiceprint** column to the speaker table with status badge
+- A **voiceprint modal** per speaker with three tabs: auto-enrol from transcript, manual segment picker, and similarity test
+- New API endpoints for all embedding operations
+- A **Voiceprints** summary card showing all enrolled speakers