Table of Contents
Fetching ...

Language Modelling for Speaker Diarization in Telephonic Interviews

Miquel India, Javier Hernando, José A. R. Fonollosa

TL;DR

This work tackles speaker diarization in telephonic two-speaker conversations by fusing language and acoustic information. The authors propose an iterative architecture with character-level word embeddings from a CharCNN-fed LSTM classifier, augmented by GMM-based acoustic scores that are reintroduced into the network across iterations. The results show substantial improvements over an acoustic-only baseline, with word-level diarization error rates around a few percent in the iterative, multimodal setting, and notable gains particularly for short speaker turns. The study demonstrates the practical viability of integrating ASR-derived linguistic cues with traditional acoustic modelling in call-center contexts and offers insights into how language cues can compensate when acoustic cues are weak. This approach advances multimodal diarization and informs future directions for end-to-end ASR and diarization integration in real-world applications.

Abstract

The aim of this paper is to investigate the benefit of combining both language and acoustic modelling for speaker diarization. Although conventional systems only use acoustic features, in some scenarios linguistic data contain high discriminative speaker information, even more reliable than the acoustic ones. In this study we analyze how an appropriate fusion of both kind of features is able to obtain good results in these cases. The proposed system is based on an iterative algorithm where a LSTM network is used as a speaker classifier. The network is fed with character-level word embeddings and a GMM based acoustic score created with the output labels from previous iterations. The presented algorithm has been evaluated in a Call-Center database, which is composed of telephone interview audios. The combination of acoustic features and linguistic content shows a 84.29% improvement in terms of a word-level DER as compared to a HMM/VB baseline system. The results of this study confirms that linguistic content can be efficiently used for some speaker recognition tasks.

Language Modelling for Speaker Diarization in Telephonic Interviews

TL;DR

This work tackles speaker diarization in telephonic two-speaker conversations by fusing language and acoustic information. The authors propose an iterative architecture with character-level word embeddings from a CharCNN-fed LSTM classifier, augmented by GMM-based acoustic scores that are reintroduced into the network across iterations. The results show substantial improvements over an acoustic-only baseline, with word-level diarization error rates around a few percent in the iterative, multimodal setting, and notable gains particularly for short speaker turns. The study demonstrates the practical viability of integrating ASR-derived linguistic cues with traditional acoustic modelling in call-center contexts and offers insights into how language cues can compensate when acoustic cues are weak. This approach advances multimodal diarization and informs future directions for end-to-end ASR and diarization integration in real-world applications.

Abstract

The aim of this paper is to investigate the benefit of combining both language and acoustic modelling for speaker diarization. Although conventional systems only use acoustic features, in some scenarios linguistic data contain high discriminative speaker information, even more reliable than the acoustic ones. In this study we analyze how an appropriate fusion of both kind of features is able to obtain good results in these cases. The proposed system is based on an iterative algorithm where a LSTM network is used as a speaker classifier. The network is fed with character-level word embeddings and a GMM based acoustic score created with the output labels from previous iterations. The presented algorithm has been evaluated in a Call-Center database, which is composed of telephone interview audios. The combination of acoustic features and linguistic content shows a 84.29% improvement in terms of a word-level DER as compared to a HMM/VB baseline system. The results of this study confirms that linguistic content can be efficiently used for some speaker recognition tasks.

Paper Structure

This paper contains 15 sections, 9 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: System Diagram.
  • Figure 2: Network Architecture Scheme. In Neural Network 1 the words are the only input. In Neural Network 2 acoustic speaker scores are input additionally in concatenation with word embeddings in the LSTM.
  • Figure 3: ASR Groundtruth Labelling: Boxes represent word segments with its respective speaker label. Arrows indicate the label assignation between transcription word labels to the ASR words. C.1, C.2 and C.3 correspond to condition 1, 2 and 3 in the direct overlapping criterion.
  • Figure 4: Turn duration distribution.
  • Figure 5: Iterative algorithm WDER parametrized by the number of iterations run in the system. The results shown correspond to the ASR condition. Iteration 0 corresponds to the initial speaker labels produced by Neural Network 1.
  • ...and 2 more figures