Table of Contents
Fetching ...

TOGGL: Transcribing Overlapping Speech with Staggered Labeling

Chak-Fai Li, William Hartmann, Matthew Snover

TL;DR

TOGGL introduces a single-decoder, token-based framework for transcribing overlapping speech by assigning tokens to speakers via [NEXT] and [PREV] switches, enabling scalable multi-speaker transcription without per-speaker decoders. The approach combines mixture-aware pretraining (Cocktail HuBERT) with supervised fine-tuning, using a CTC objective and a time-aligned merging of transcripts interleaved with switching tokens. Experiments on conversational Fisher/Switchboard data show TOGGL outperforms baselines and generalizes to up to four speakers, with the 3-mix variant delivering the strongest performance, particularly under high overlap. The work demonstrates that training on overlapping data can improve single-speaker ASR and highlights the importance of pretraining strategies and token-level serialization for robust multi-speaker transcription.

Abstract

Transcribing the speech of multiple overlapping speakers typically requires separating the audio into multiple streams and recognizing each one independently. More recent work jointly separates and transcribes, but requires a separate decoding component for each speaker. We propose the TOGGL model to simultaneously transcribe the speech of multiple speakers. The TOGGL model uses special output tokens to attribute the speech to each speaker with only a single decoder. Our approach generalizes beyond two speakers, even when trained only on two-speaker data. We demonstrate superior performance compared to competing approaches on a conversational speech dataset. Our approach also improves performance on single-speaker audio.

TOGGL: Transcribing Overlapping Speech with Staggered Labeling

TL;DR

TOGGL introduces a single-decoder, token-based framework for transcribing overlapping speech by assigning tokens to speakers via [NEXT] and [PREV] switches, enabling scalable multi-speaker transcription without per-speaker decoders. The approach combines mixture-aware pretraining (Cocktail HuBERT) with supervised fine-tuning, using a CTC objective and a time-aligned merging of transcripts interleaved with switching tokens. Experiments on conversational Fisher/Switchboard data show TOGGL outperforms baselines and generalizes to up to four speakers, with the 3-mix variant delivering the strongest performance, particularly under high overlap. The work demonstrates that training on overlapping data can improve single-speaker ASR and highlights the importance of pretraining strategies and token-level serialization for robust multi-speaker transcription.

Abstract

Transcribing the speech of multiple overlapping speakers typically requires separating the audio into multiple streams and recognizing each one independently. More recent work jointly separates and transcribes, but requires a separate decoding component for each speaker. We propose the TOGGL model to simultaneously transcribe the speech of multiple speakers. The TOGGL model uses special output tokens to attribute the speech to each speaker with only a single decoder. Our approach generalizes beyond two speakers, even when trained only on two-speaker data. We demonstrate superior performance compared to competing approaches on a conversational speech dataset. Our approach also improves performance on single-speaker audio.
Paper Structure (13 sections, 2 figures, 3 tables)

This paper contains 13 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Example decoding output including the special [NEXT] and [PREV] tokens. Given this output, it can easily be separated into utterances for the two speakers.
  • Figure 2: Example of two training utterances being stitched together into a single transcription using the special [NEXT] and [PREV] tokens.