Exploring Spoken Language Identification Strategies for Automatic Transcription of Multilingual Broadcast and Institutional Speech

Martina Valente; Fabio Brugnara; Giovanni Morrone; Enrico Zovato; Leonardo Badino

Exploring Spoken Language Identification Strategies for Automatic Transcription of Multilingual Broadcast and Institutional Speech

Martina Valente, Fabio Brugnara, Giovanni Morrone, Enrico Zovato, Leonardo Badino

TL;DR

The paper tackles spoken language identification in authentic multilingual broadcast and institutional speech, where language changes tend to track speaker changes. It proposes a speaker-informed cascaded approach combining speaker diarization with segment-based SLI, and contrasts it with speaker-agnostic baselines, including a frame-based SLI variant for language diarization. Across broadcast and institutional data, the SD+SLI cascade achieves lower language diarization error rates and improved multilingual transcription accuracy, with up to around 8% relative WER reduction, while preserving monolingual ASR performance. The results support deploying a speaker-change–aware front-end to enhance multilingual ASR in real-world settings and clarify the conditions under which frame-based versus segment-based SLI benefits arise.

Abstract

This paper addresses spoken language identification (SLI) and speech recognition of multilingual broadcast and institutional speech, real application scenarios that have been rarely addressed in the SLI literature. Observing that in these domains language changes are mostly associated with speaker changes, we propose a cascaded system consisting of speaker diarization and language identification and compare it with more traditional language identification and language diarization systems. Results show that the proposed system often achieves lower language classification and language diarization error rates (up to 10% relative language diarization error reduction and 60% relative language confusion reduction) and leads to lower WERs on multilingual test sets (more than 8% relative WER reduction), while at the same time does not negatively affect speech recognition on monolingual audio (with an absolute WER increase between 0.1% and 0.7% w.r.t. monolingual ASR).

Exploring Spoken Language Identification Strategies for Automatic Transcription of Multilingual Broadcast and Institutional Speech

TL;DR

Abstract

Exploring Spoken Language Identification Strategies for Automatic Transcription of Multilingual Broadcast and Institutional Speech

Authors

TL;DR

Abstract

Table of Contents