Scaling Multi-Talker ASR with Speaker-Agnostic Activity Streams

Xiluo He; Alexander Polok; Jesús Villalba; Thomas Thebaud; Matthew Maciejewski

Scaling Multi-Talker ASR with Speaker-Agnostic Activity Streams

Xiluo He, Alexander Polok, Jesús Villalba, Thomas Thebaud, Matthew Maciejewski

TL;DR

The paper tackles the high inference cost of multi-talker ASR by decoupling runtime from the number of active speakers using Heuristic Error Assignment Training (HEAT). HEAT merges speaker activities into two speaker-agnostic streams and conditions a target-speaker ASR backbone (DiCoW) via stream-specific activity cues, enabled by a Frame-Level Diarization Dependent Transformation. Through oracle and diarization-based experiments on AMI, ICSI, and SparseLibriMix, the approach achieves substantial runtime gains (RTFx improvements) while maintaining competitive time-constrained WER (tcORC-WER). The work demonstrates both the practical viability of two-stream conditioning and lays groundwork for end-to-end HEAT outputs and streaming backbones, with release-ready code for reproducibility.

Abstract

An increasingly common training paradigm for multi-talker automatic speech recognition (ASR) is to use speaker activity signals to adapt single-speaker ASR models for overlapping speech. Although effective, these systems require running the ASR model once per speaker, resulting in inference costs that scale with the number of speakers and limiting their practicality. In this work, we propose a method that decouples the inference cost of activity-conditioned ASR systems from the number of speakers by converting speaker-specific activity outputs into two speaker-agnostic streams. A central challenge is that naïvely merging speaker activities into streams significantly degrades recognition, since pretrained ASR models assume contiguous, single-speaker inputs. To address this, we design new heuristics aimed at preserving conversational continuity and maintaining compatibility with existing systems. We show that our approach is compatible with Diarization-Conditioned Whisper (DiCoW) to greatly reduce runtimes on the AMI and ICSI meeting datasets while retaining competitive performance.

Scaling Multi-Talker ASR with Speaker-Agnostic Activity Streams

TL;DR

Abstract

Scaling Multi-Talker ASR with Speaker-Agnostic Activity Streams

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)