TOGGL: Transcribing Overlapping Speech with Staggered Labeling
Chak-Fai Li, William Hartmann, Matthew Snover
TL;DR
TOGGL introduces a single-decoder, token-based framework for transcribing overlapping speech by assigning tokens to speakers via [NEXT] and [PREV] switches, enabling scalable multi-speaker transcription without per-speaker decoders. The approach combines mixture-aware pretraining (Cocktail HuBERT) with supervised fine-tuning, using a CTC objective and a time-aligned merging of transcripts interleaved with switching tokens. Experiments on conversational Fisher/Switchboard data show TOGGL outperforms baselines and generalizes to up to four speakers, with the 3-mix variant delivering the strongest performance, particularly under high overlap. The work demonstrates that training on overlapping data can improve single-speaker ASR and highlights the importance of pretraining strategies and token-level serialization for robust multi-speaker transcription.
Abstract
Transcribing the speech of multiple overlapping speakers typically requires separating the audio into multiple streams and recognizing each one independently. More recent work jointly separates and transcribes, but requires a separate decoding component for each speaker. We propose the TOGGL model to simultaneously transcribe the speech of multiple speakers. The TOGGL model uses special output tokens to attribute the speech to each speaker with only a single decoder. Our approach generalizes beyond two speakers, even when trained only on two-speaker data. We demonstrate superior performance compared to competing approaches on a conversational speech dataset. Our approach also improves performance on single-speaker audio.
