Implicit Self-supervised Language Representation for Spoken Language Diarization

Jagabandhu Mishra; S. R. Mahadeva Prasanna

Implicit Self-supervised Language Representation for Spoken Language Diarization

Jagabandhu Mishra, S. R. Mahadeva Prasanna

TL;DR

The paper investigates implicit language representations for spoken language diarization in code-switched data, proposing three LD frameworks (fixed segmentation, change point-based segmentation, and End-to-End) and comparing them to explicit LD. It demonstrates that x-vector based implicit representations can match explicit LD on synthetic data, with End-to-End implicit LD achieving the best performance ($DER=5.81$, $JER=6.38$) on TTSF-LD. To address practical CS data challenges, the study introduces self-supervised wav2vec (W2V) representations, showing they surpass x-vector performance in both fixed and change-point setups and yield strong E2E results ($DER \,\approx\,11.2$, $JER\approx\,21.8$) with attention pooling. On the MSCS corpus, W2V-based methods mitigate primary-language bias and improve LD, though challenges persist due to imbalanced data and short secondary-language segments; future work includes domain-adaptive and generative regularization approaches to further close the gap. Overall, the work highlights the potential of self-supervised implicit representations for robust LD in CS scenarios, especially for low-resource languages, and provides a systematic framework to evaluate them across multiple diarization paradigms.

Abstract

In a code-switched (CS) scenario, the use of spoken language diarization (LD) as a pre-possessing system is essential. Further, the use of implicit frameworks is preferable over the explicit framework, as it can be easily adapted to deal with low/zero resource languages. Inspired by speaker diarization (SD) literature, three frameworks based on (1) fixed segmentation, (2) change point-based segmentation and (3) E2E are proposed to perform LD. The initial exploration with synthetic TTSF-LD dataset shows, using x-vector as implicit language representation with appropriate analysis window length ($N$) can able to achieve at per performance with explicit LD. The best implicit LD performance of $6.38$ in terms of Jaccard error rate (JER) is achieved by using the E2E framework. However, considering the E2E framework the performance of implicit LD degrades to $60.4$ while using with practical Microsoft CS (MSCS) dataset. The difference in performance is mostly due to the distributional difference between the monolingual segment duration of secondary language in the MSCS and TTSF-LD datasets. Moreover, to avoid segment smoothing, the smaller duration of the monolingual segment suggests the use of a small value of $N$. At the same time with small $N$, the x-vector representation is unable to capture the required language discrimination due to the acoustic similarity, as the same speaker is speaking both languages. Therefore, to resolve the issue a self-supervised implicit language representation is proposed in this study. In comparison with the x-vector representation, the proposed representation provides a relative improvement of $63.9\%$ and achieved a JER of $21.8$ using the E2E framework.

Implicit Self-supervised Language Representation for Spoken Language Diarization

TL;DR

) on TTSF-LD. To address practical CS data challenges, the study introduces self-supervised wav2vec (W2V) representations, showing they surpass x-vector performance in both fixed and change-point setups and yield strong E2E results (

) with attention pooling. On the MSCS corpus, W2V-based methods mitigate primary-language bias and improve LD, though challenges persist due to imbalanced data and short secondary-language segments; future work includes domain-adaptive and generative regularization approaches to further close the gap. Overall, the work highlights the potential of self-supervised implicit representations for robust LD in CS scenarios, especially for low-resource languages, and provides a systematic framework to evaluate them across multiple diarization paradigms.

Abstract

) can able to achieve at per performance with explicit LD. The best implicit LD performance of

in terms of Jaccard error rate (JER) is achieved by using the E2E framework. However, considering the E2E framework the performance of implicit LD degrades to

while using with practical Microsoft CS (MSCS) dataset. The difference in performance is mostly due to the distributional difference between the monolingual segment duration of secondary language in the MSCS and TTSF-LD datasets. Moreover, to avoid segment smoothing, the smaller duration of the monolingual segment suggests the use of a small value of

. At the same time with small

, the x-vector representation is unable to capture the required language discrimination due to the acoustic similarity, as the same speaker is speaking both languages. Therefore, to resolve the issue a self-supervised implicit language representation is proposed in this study. In comparison with the x-vector representation, the proposed representation provides a relative improvement of

and achieved a JER of

using the E2E framework.

Paper Structure (24 sections, 9 equations, 16 figures, 14 tables)

This paper contains 24 sections, 9 equations, 16 figures, 14 tables.

Introduction
Database details
Spoken Language Diarization with TTSF-LD dataset
Diarization with Implicit x-vector Representation
Training of x-vector architecture
x-vector representation
Diarization with fixed segmentation
Diarization with change point based segmentation
End-to-end diarization
Explicit Spoken Language Diarization
Explicit Language Representation
Performances of Explicit LD
Diarization with practical CS utterances
Implicit LD with Fixed and change point inspired segmentation
End-to-End Implicit LD
...and 9 more sections

Figures (16)

Figure 1: (a) Time domain representation of a Code-switched speech utterance, (b) spectrogram, (c) t-SNE distribution of the MFCC features, (d) W2V based ASR posterior and (e) x-vector representations, respectively.
Figure 2: Block diagram of implicit diarization framework. VAD: voice activity detection, IR: implicit representation, and DC: deep clustering.
Figure 3: Block diagram of x-vector architecture. FV: MFCC feature vector, B: batch size, N: analysis window length, and $x_{a}$/$x_{b}$: x-vector.
Figure 4: GPLDA score distribution of x-vector representation between the trials of, (a) WS and BS with $N=50$ (EER$=0.001$), (b) WL and BL with $N=50$ (EER$=17$) and (c) with $N=200$ (EER$=3.6$), respectively.
Figure 5: Diarization framework with fixed segmentation, VAD: voice activity detection, IR: implicit representation, PM: projection matrix, AHC: agglomerative hierarchical clustering.
...and 11 more figures

Implicit Self-supervised Language Representation for Spoken Language Diarization

TL;DR

Abstract

Implicit Self-supervised Language Representation for Spoken Language Diarization

Authors

TL;DR

Abstract

Table of Contents

Figures (16)