Table of Contents
Fetching ...

SAGE-LD: Towards Scalable and Generalizable End-to-End Language Diarization via Simulated Data Augmentation

Sangmin Lee, Woongjib Choi, Jihyun Kim, Hong-Goo Kang

TL;DR

SAGE-LD tackles language diarization under unconstrained multilingual conditions by integrating a learnable language-query decoder with multilingual features and large-scale simulated code-switching pretraining. The method employs a three-component architecture (feature extractor, contextual encoder, masked decoder) and introduces data augmentation via simulated utterances to decouple language and speaker shifts, followed by a language-aware two-stage training regime. Empirical results show state-of-the-art performance across multiple benchmarks, including long-form and short-form speech, with headroom from simulated pretraining and a specialized loss design. This work offers a scalable, generalizable framework that advances code-switching speech technologies and broadens language coverage for LD systems.

Abstract

In this paper, we present a neural spoken language diarization model that supports an unconstrained span of languages within a single framework. Our approach integrates a learnable query-based architecture grounded in multilingual awareness, with large-scale pretraining on simulated code-switching data. By jointly leveraging these two components, our method overcomes the limitations of conventional approaches in data scarcity and architecture optimization, and generalizes effectively to real-world multilingual settings across diverse environments. Experimental results demonstrate that our approach achieves state-of-the-art performance on several language diarization benchmarks, with a relative performance improvement of 23% to 52% over previous methods. We believe that this work not only advances research in language diarization but also establishes a foundational framework for code-switching speech technologies.

SAGE-LD: Towards Scalable and Generalizable End-to-End Language Diarization via Simulated Data Augmentation

TL;DR

SAGE-LD tackles language diarization under unconstrained multilingual conditions by integrating a learnable language-query decoder with multilingual features and large-scale simulated code-switching pretraining. The method employs a three-component architecture (feature extractor, contextual encoder, masked decoder) and introduces data augmentation via simulated utterances to decouple language and speaker shifts, followed by a language-aware two-stage training regime. Empirical results show state-of-the-art performance across multiple benchmarks, including long-form and short-form speech, with headroom from simulated pretraining and a specialized loss design. This work offers a scalable, generalizable framework that advances code-switching speech technologies and broadens language coverage for LD systems.

Abstract

In this paper, we present a neural spoken language diarization model that supports an unconstrained span of languages within a single framework. Our approach integrates a learnable query-based architecture grounded in multilingual awareness, with large-scale pretraining on simulated code-switching data. By jointly leveraging these two components, our method overcomes the limitations of conventional approaches in data scarcity and architecture optimization, and generalizes effectively to real-world multilingual settings across diverse environments. Experimental results demonstrate that our approach achieves state-of-the-art performance on several language diarization benchmarks, with a relative performance improvement of 23% to 52% over previous methods. We believe that this work not only advances research in language diarization but also establishes a foundational framework for code-switching speech technologies.

Paper Structure

This paper contains 12 sections, 4 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Architecture of the SAGE-LD, and we set $n=6$. The model comprises three modules: feature extractor, contextual encoder, and decoder with learnable language queries.
  • Figure 2: Architecture of the decoder with $n=4$. Each query $q_i$, mask $m_i$, and activity $c_i$ is iteratively refined, and the classification head sorts active queries to generate a prediction $m_i^\theta$.