Table of Contents
Fetching ...

Speaker Fuzzy Fingerprints: Benchmarking Text-Based Identification in Multiparty Dialogues

Rui Ribeiro, Luísa Coheur, Joao P. Carvalho

TL;DR

This work tackles text-based speaker identification in multiparty dialogues by leveraging fuzzy fingerprints derived from large pre-trained models, augmented with speaker-specific tokens and context-aware input processing. It demonstrates that incorporating conversational context significantly boosts accuracy, achieving 70.6% on Friends and 67.7% on Big Bang Theory, while fuzzy fingerprints offer near full fine-tuning performance with far fewer hidden units and improved interpretability. The study also provides a mechanism to detect speaker-agnostic or generic utterances, highlighting remaining challenges with short or non-distinctive lines. Overall, the approach advances robust, interpretable text-based speaker identification and outlines directions for further improving context modeling and disambiguation in multi-speaker dialogues.

Abstract

Speaker identification using voice recordings leverages unique acoustic features, but this approach fails when only textual data is available. Few approaches have attempted to tackle the problem of identifying speakers solely from text, and the existing ones have primarily relied on traditional methods. In this work, we explore the use of fuzzy fingerprints from large pre-trained models to improve text-based speaker identification. We integrate speaker-specific tokens and context-aware modeling, demonstrating that conversational context significantly boosts accuracy, reaching 70.6% on the Friends dataset and 67.7% on the Big Bang Theory dataset. Additionally, we show that fuzzy fingerprints can approximate full fine-tuning performance with fewer hidden units, offering improved interpretability. Finally, we analyze ambiguous utterances and propose a mechanism to detect speaker-agnostic lines. Our findings highlight key challenges and provide insights for future improvements in text-based speaker identification.

Speaker Fuzzy Fingerprints: Benchmarking Text-Based Identification in Multiparty Dialogues

TL;DR

This work tackles text-based speaker identification in multiparty dialogues by leveraging fuzzy fingerprints derived from large pre-trained models, augmented with speaker-specific tokens and context-aware input processing. It demonstrates that incorporating conversational context significantly boosts accuracy, achieving 70.6% on Friends and 67.7% on Big Bang Theory, while fuzzy fingerprints offer near full fine-tuning performance with far fewer hidden units and improved interpretability. The study also provides a mechanism to detect speaker-agnostic or generic utterances, highlighting remaining challenges with short or non-distinctive lines. Overall, the approach advances robust, interpretable text-based speaker identification and outlines directions for further improving context modeling and disambiguation in multi-speaker dialogues.

Abstract

Speaker identification using voice recordings leverages unique acoustic features, but this approach fails when only textual data is available. Few approaches have attempted to tackle the problem of identifying speakers solely from text, and the existing ones have primarily relied on traditional methods. In this work, we explore the use of fuzzy fingerprints from large pre-trained models to improve text-based speaker identification. We integrate speaker-specific tokens and context-aware modeling, demonstrating that conversational context significantly boosts accuracy, reaching 70.6% on the Friends dataset and 67.7% on the Big Bang Theory dataset. Additionally, we show that fuzzy fingerprints can approximate full fine-tuning performance with fewer hidden units, offering improved interpretability. Finally, we analyze ambiguous utterances and propose a mechanism to detect speaker-agnostic lines. Our findings highlight key challenges and provide insights for future improvements in text-based speaker identification.

Paper Structure

This paper contains 20 sections, 2 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Distribution of Turns per Speaker for the Friends Corpus (O - Other; RO - Ross; RA - Rachel; J - Joey; C - Chandler; P - Phoebe; M - Monica).
  • Figure 2: Turn distribution among characters in the Big Bang Theory dataset (S - Sheldon; L - Leonard; P - Penny; O - Other; H - Howard; R - Raj; A - Amy; B - Bernadette).
  • Figure 3: Accuracy variation with different fuzzy fingerprint sizes on the Friends and Big Bang Theory datasets. The Fuzzy Fingerprint (FFP) model (solid lines) retains only the top-$k$ hidden units from the last hidden layer, while RoBERTa (dashed lines) utilizes all 768 hidden units.
  • Figure 4: Histogram of utterance lengths comparing correct and incorrect predictions. Darker red areas indicate regions where correct and incorrect samples overlap.
  • Figure 5: Confusion matrix for the Friends dataset using a fuzzy fingerprint $k=409$.
  • ...and 1 more figures