Who Spoke What When? Evaluating Spoken Language Models for Conversational ASR with Semantic and Overlap-Aware Metrics

Naohiro Tawara; Samuele Cornell; Alexander Polok; Marc Delcroix; Lukáš Burget; Shinji Watanabe

Who Spoke What When? Evaluating Spoken Language Models for Conversational ASR with Semantic and Overlap-Aware Metrics

Naohiro Tawara, Samuele Cornell, Alexander Polok, Marc Delcroix, Lukáš Burget, Shinji Watanabe

Abstract

Conversational automatic speech recognition remains challenging due to overlapping speech, far-field noise, and varying speaker counts. While recent LLM-based systems perform well on single-speaker benchmarks, their robustness in multi-speaker settings is unclear. We systematically compare LLM-based and modular pipeline approaches along four axes: overlap robustness, semantic fidelity, speaker count, and single- versus multi-channel input. To capture meaning-altering errors that conventional metrics miss, we introduce tcpSemER, which extends tcpWER by replacing Levenshtein distance with embedding-based semantic similarity. We further decompose tcpWER into overlapping and non-overlapping components for finer-grained analysis. Experiments across three datasets show that LLM-based systems are competitive in two-speaker settings but degrade as speaker count and overlap increase, whereas modular pipelines remain more robust.

Who Spoke What When? Evaluating Spoken Language Models for Conversational ASR with Semantic and Overlap-Aware Metrics

Abstract

Paper Structure (13 sections, 3 equations, 2 figures, 3 tables)

This paper contains 13 sections, 3 equations, 2 figures, 3 tables.

Introduction
Experimental Protocol
Systems Under Evaluation
Datasets
Evaluation Metrics
Overlap-Aware cpWER and tcpWER
tcpSemER: Semantic Error Rate for Long-Form Multi-Talker Audio
Experimental Analysis
Overall Results
Effect of Overlap
Impact of Number of Speakers and Overlap Ratio
Concluding Remarks
Generative AI Use Disclosure

Figures (2)

Figure 1: Relative change in tcpWER and tcpSemER when switching from CHiME-8 to CHiME-7 text normalization, averaged across the top-4 CHiME-8 DASR challenge systems per dataset. Error bars denote standard deviation.
Figure 2: Decomposition of tcpWER into deletion, insertion, and substitution errors for VibeVoice (Vib), DiCoW (DiC), and CH8 DASR NTT (NTT) on NSF1, shown by speaker count. The proportion of overlapped speech (ovl) is shown in parentheses.

Who Spoke What When? Evaluating Spoken Language Models for Conversational ASR with Semantic and Overlap-Aware Metrics

Abstract

Who Spoke What When? Evaluating Spoken Language Models for Conversational ASR with Semantic and Overlap-Aware Metrics

Authors

Abstract

Table of Contents

Figures (2)