Table of Contents
Fetching ...

Speaker-Aware Simulation Improves Conversational Speech Recognition

Máté Gedeon, Péter Mihajlik

TL;DR

This work addresses the scarcity of large-scale Hungarian conversational data for ASR by adapting Speaker-Aware Simulated Conversations (SASC) and introducing a duration-conditioned variant (C-SASC) to model pauses and overlap using $\delta$ and $d_n$. Synthetic dialogues are generated from the BEA-Large corpus and combined with BEA-Dialogue data, with statistics derived from CallHome, BEA-Dialogue, and GRASS to train and evaluate on real Hungarian conversations. Experiments show that speaker-aware simulation improves transcription performance over naive concatenation, with C-SASC offering systematic but modest gains in character-level metrics when source statistics align with the target domain; dependence on domain-matched statistics is a key takeaway. The study demonstrates the practicality of temporally-aware synthetic dialogue generation for low-resource languages and provides guidance on when duration-conditioned modeling yields benefits and how dataset size and RIR augmentation influence results.

Abstract

Automatic speech recognition (ASR) for conversational speech remains challenging due to the limited availability of large-scale, well-annotated multi-speaker dialogue data and the complex temporal dynamics of natural interactions. Speaker-aware simulated conversations (SASC) offer an effective data augmentation strategy by transforming single-speaker recordings into realistic multi-speaker dialogues. However, prior work has primarily focused on English data, leaving questions about the applicability to lower-resource languages. In this paper, we adapt and implement the SASC framework for Hungarian conversational ASR. We further propose C-SASC, an extended variant that incorporates pause modeling conditioned on utterance duration, enabling a more faithful representation of local temporal dependencies observed in human conversation while retaining the simplicity and efficiency of the original approach. We generate synthetic Hungarian dialogues from the BEA-Large corpus and combine them with real conversational data for ASR training. Both SASC and C-SASC are evaluated extensively under a wide range of simulation configurations, using conversational statistics derived from CallHome, BEA-Dialogue, and GRASS corpora. Experimental results show that speaker-aware conversational simulation consistently improves recognition performance over naive concatenation-based augmentation. While the additional duration conditioning in C-SASC yields modest but systematic gains--most notably in character-level error rates--its effectiveness depends on the match between source conversational statistics and the target domain. Overall, our findings confirm the robustness of speaker-aware conversational simulation for Hungarian ASR and highlight the benefits and limitations of increasingly detailed temporal modeling in synthetic dialogue generation.

Speaker-Aware Simulation Improves Conversational Speech Recognition

TL;DR

This work addresses the scarcity of large-scale Hungarian conversational data for ASR by adapting Speaker-Aware Simulated Conversations (SASC) and introducing a duration-conditioned variant (C-SASC) to model pauses and overlap using and . Synthetic dialogues are generated from the BEA-Large corpus and combined with BEA-Dialogue data, with statistics derived from CallHome, BEA-Dialogue, and GRASS to train and evaluate on real Hungarian conversations. Experiments show that speaker-aware simulation improves transcription performance over naive concatenation, with C-SASC offering systematic but modest gains in character-level metrics when source statistics align with the target domain; dependence on domain-matched statistics is a key takeaway. The study demonstrates the practicality of temporally-aware synthetic dialogue generation for low-resource languages and provides guidance on when duration-conditioned modeling yields benefits and how dataset size and RIR augmentation influence results.

Abstract

Automatic speech recognition (ASR) for conversational speech remains challenging due to the limited availability of large-scale, well-annotated multi-speaker dialogue data and the complex temporal dynamics of natural interactions. Speaker-aware simulated conversations (SASC) offer an effective data augmentation strategy by transforming single-speaker recordings into realistic multi-speaker dialogues. However, prior work has primarily focused on English data, leaving questions about the applicability to lower-resource languages. In this paper, we adapt and implement the SASC framework for Hungarian conversational ASR. We further propose C-SASC, an extended variant that incorporates pause modeling conditioned on utterance duration, enabling a more faithful representation of local temporal dependencies observed in human conversation while retaining the simplicity and efficiency of the original approach. We generate synthetic Hungarian dialogues from the BEA-Large corpus and combine them with real conversational data for ASR training. Both SASC and C-SASC are evaluated extensively under a wide range of simulation configurations, using conversational statistics derived from CallHome, BEA-Dialogue, and GRASS corpora. Experimental results show that speaker-aware conversational simulation consistently improves recognition performance over naive concatenation-based augmentation. While the additional duration conditioning in C-SASC yields modest but systematic gains--most notably in character-level error rates--its effectiveness depends on the match between source conversational statistics and the target domain. Overall, our findings confirm the robustness of speaker-aware conversational simulation for Hungarian ASR and highlight the benefits and limitations of increasingly detailed temporal modeling in synthetic dialogue generation.
Paper Structure (13 sections, 7 equations, 5 figures, 3 tables, 2 algorithms)

This paper contains 13 sections, 7 equations, 5 figures, 3 tables, 2 algorithms.

Figures (5)

  • Figure 1: Modeled pause distributions for consecutive utterances under different speaker conditions.
  • Figure 2: Utterance duration distributions across corpora.
  • Figure 3: Speaker transition model estimated from the CallHome corpus.
  • Figure 4: Ratio of the BEA-Dialogue training set to the whole training set with increasing simulation size.
  • Figure 5: Effect of simulated dataset size on cpWER and cpCER.