Table of Contents
Fetching ...

The Impact of Automatic Speech Transcription on Speaker Attribution

Cristina Aggazzotti, Matthew Wiesner, Elizabeth Allyn Smith, Nicholas Andrews

TL;DR

This work investigates how automatic transcription errors from ASR affect speaker attribution from transcripts. It evaluates five text-based attribution models across transcripts from five diverse ASR systems on the Fisher corpus, using cpWER to quantify transcription differences and AUC to measure attribution performance. The main finding is that attribution remains robust to word-level errors and can even improve with ASR-derived transcripts, potentially because transcription mistakes encode speaker-specific signals; performance does not systematically degrade even at high cpWER, and utterance-length cues can establish a lower bound on accuracy in extreme cases. The results motivate developing speech-aware attribution models and caution against over-reliance on generic text models, while highlighting practical implications and ethical considerations for transcript-based speaker identification in real-world settings.

Abstract

Speaker attribution from speech transcripts is the task of identifying a speaker from the transcript of their speech based on patterns in their language use. This task is especially useful when the audio is unavailable (e.g. deleted) or unreliable (e.g. anonymized speech). Prior work in this area has primarily focused on the feasibility of attributing speakers using transcripts produced by human annotators. However, in real-world settings, one often only has more errorful transcripts produced by automatic speech recognition (ASR) systems. In this paper, we conduct what is, to our knowledge, the first comprehensive study of the impact of automatic transcription on speaker attribution performance. In particular, we study the extent to which speaker attribution performance degrades in the face of transcription errors, as well as how properties of the ASR system impact attribution. We find that attribution is surprisingly resilient to word-level transcription errors and that the objective of recovering the true transcript is minimally correlated with attribution performance. Overall, our findings suggest that speaker attribution on more errorful transcripts produced by ASR is as good, if not better, than attribution based on human-transcribed data, possibly because ASR transcription errors can capture speaker-specific features revealing of speaker identity.

The Impact of Automatic Speech Transcription on Speaker Attribution

TL;DR

This work investigates how automatic transcription errors from ASR affect speaker attribution from transcripts. It evaluates five text-based attribution models across transcripts from five diverse ASR systems on the Fisher corpus, using cpWER to quantify transcription differences and AUC to measure attribution performance. The main finding is that attribution remains robust to word-level errors and can even improve with ASR-derived transcripts, potentially because transcription mistakes encode speaker-specific signals; performance does not systematically degrade even at high cpWER, and utterance-length cues can establish a lower bound on accuracy in extreme cases. The results motivate developing speech-aware attribution models and caution against over-reliance on generic text models, while highlighting practical implications and ethical considerations for transcript-based speaker identification in real-world settings.

Abstract

Speaker attribution from speech transcripts is the task of identifying a speaker from the transcript of their speech based on patterns in their language use. This task is especially useful when the audio is unavailable (e.g. deleted) or unreliable (e.g. anonymized speech). Prior work in this area has primarily focused on the feasibility of attributing speakers using transcripts produced by human annotators. However, in real-world settings, one often only has more errorful transcripts produced by automatic speech recognition (ASR) systems. In this paper, we conduct what is, to our knowledge, the first comprehensive study of the impact of automatic transcription on speaker attribution performance. In particular, we study the extent to which speaker attribution performance degrades in the face of transcription errors, as well as how properties of the ASR system impact attribution. We find that attribution is surprisingly resilient to word-level transcription errors and that the objective of recovering the true transcript is minimally correlated with attribution performance. Overall, our findings suggest that speaker attribution on more errorful transcripts produced by ASR is as good, if not better, than attribution based on human-transcribed data, possibly because ASR transcription errors can capture speaker-specific features revealing of speaker identity.

Paper Structure

This paper contains 24 sections, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Minimum, average, and maximum (each color's height) attribution model AUC performance (y-axis) on 'hard' difficulty level gold standard Fisher test verification trials (0%) and several automatic transcript test trials with various cpWERs (x-axis). Despite significant increases in cpWER, attribution performance surprisingly stays fairly constant.
  • Figure 2: Minimum, average, and maximum (each color's height) attribution model AUC performance (y-axis) on gold standard, German ASR (DEU), R (replacing each token with the same word), R$_{\overline{U}}$ (R $+$ truncating utterances to the mean length for that speaker in that call), R$_{U_{10}}$ (R $+$ truncating all utterances to 10 tokens), and R$_{T_{50},U_{10}}$ (R $+$ truncating transcripts to 50 utterances and utterances to 10 tokens) test verification trials (x-axis). Attribution performance does not drop significantly unless all utterance lengths are equalized.
  • Figure 3: Ratio of unigram/bigram overlap in positive to negative trials (y-axis) across all ASR systems depicted by their cpWER (x-axis). All values are $>1$, indicating more overlap in positive trials, especially for German ASR (90%) bigrams, helping explain why attribution performance does not significantly degrade.
  • Figure 4: Ratio of mean utterance length delta in positive to negative trials (y-axis) across all ASR systems depicted by their cpWER (x-axis) in all three difficulties. All values are $<1$, indicating a larger difference in utterance lengths in negative trials, and consistent across difficulty levels, suggesting that utterance length depends more on speaker than on topic.