Spoken language change detection inspired by speaker change detection

Jagabandhu Mishra; S. R. Mahadeva Prasanna

Spoken language change detection inspired by speaker change detection

Jagabandhu Mishra, S. R. Mahadeva Prasanna

TL;DR

This work investigates spoken language change detection (LCD) by borrowing architectures from speaker change detection (SCD), comparing unsupervised distance-based methods with model-based approaches that leverage language priors via i-vectors and x-vectors. A human study reveals LCD requires longer contextual information and benefits from prior language exposure, motivating longer analysis windows and language-aware modeling. The unsupervised LCD framework increases window length to improve change-point evidence, achieving notable gains on synthetic data (≈29.1% relative) and modest gains on real data (≈2.4% relative). Model-based LCD with language embeddings (x-vectors) yields substantial improvements over unsupervised results on synthetic data (≈31.63% relative) and meaningful, though dataset-dependent, gains on MSCSTB; the findings underscore the value of language priors while highlighting challenges posed by short monolingual segments in practical corpora.

Abstract

Spoken language change detection (LCD) refers to identifying the language transitions in a code-switched utterance. Similarly, identifying the speaker transitions in a multispeaker utterance is known as speaker change detection (SCD). Since tasks-wise both are similar, the architecture/framework developed for the SCD task may be suitable for the LCD task. Hence, the aim of the present work is to develop LCD systems inspired by SCD. Initially, both LCD and SCD are performed by humans. The study suggests humans require (a) a larger duration around the change point and (b) language-specific prior exposure, for performing LCD as compared to SCD. The larger duration requirement is incorporated by increasing the analysis window length of the unsupervised distance-based approach. This leads to a relative performance improvement of 29.1% and 2.4%, and a priori language knowledge provides a relative improvement of 31.63% and 14.27% on the synthetic and practical codeswitched datasets, respectively. The performance difference between the practical and synthetic datasets is mostly due to differences in the distribution of the monolingual segment duration.

Spoken language change detection inspired by speaker change detection

TL;DR

Abstract

Paper Structure (11 sections, 5 equations, 12 figures, 4 tables)

This paper contains 11 sections, 5 equations, 12 figures, 4 tables.

Introduction
Database setup
Human subjective study for language and speaker change detection
LCD and SCD using unsupervised distance-based approach
Language Change Detection by Model-based Approach
Model-based change detection framework
Experimental Setup
Language discrimination by statistical/embedding vectors
Experimental Results
Discussion
Conclusion

Figures (12)

Figure 1: (a) and (c) Two speaker time domain speech signal and its spectrogram, respectively. (b) and (d) Two languages (Bilingual) time domain speech signal and its spectrogram, respectively.
Figure 2: (a) $DER$ distributions of the subjects, (b) F-Statistics (F Stat) values of the ANOVA test between the $DER$ distributions of LCD (L) and SCD (S) study, respectively .
Figure 3: Median values of the (a) $NR$ and (b) $RT$ distributions for the LCD and SCD.
Figure 4: DER vs. language comfortability score (LCS) for LCD with NVF-50 and NVF-75.
Figure 5: Basic block diagram of the change detection framework for unsupervised distance-based approach
...and 7 more figures

Spoken language change detection inspired by speaker change detection

TL;DR

Abstract

Spoken language change detection inspired by speaker change detection

Authors

TL;DR

Abstract

Table of Contents

Figures (12)