Table of Contents
Fetching ...

Detecting the terminality of speech-turn boundary for spoken interactions in French TV and Radio content

Rémi Uro, Marie Tahon, David Doukhan, Antoine Laurent, Albert Rilliard

TL;DR

The paper addresses automatic detection of Terminal turns, or Transition Relevance Places (TRPs), in spontaneous French broadcast dialogues to enable large-scale turn-taking analysis. It employs multimodal models that fuse audio (wav2vec2-base) and text (FlauBERT) representations, comparing Audio Only, Text Only, and fusion strategies (Early, Late, Average) on a French broadcast corpus. The study reports mean accuracies above 0.85 across configurations, with Audio Only and Early Fusion delivering the strongest performance, and shows that fully automatic 3-second window inputs still achieve over 90% accuracy, demonstrating robustness to preprocessing. By releasing code and the dataset, the work supports reproducible, scalable analysis of turn-taking in media and potential applications in human–machine interaction and sociolinguistic research.

Abstract

Transition Relevance Places are defined as the end of an utterance where the interlocutor may take the floor without interrupting the current speaker --i.e., a place where the turn is terminal. Analyzing turn terminality is useful to study the dynamic of turn-taking in spontaneous conversations. This paper presents an automatic classification of spoken utterances as Terminal or Non-Terminal in multi-speaker settings. We compared audio, text, and fusions of both approaches on a French corpus of TV and Radio extracts annotated with turn-terminality information at each speaker change. Our models are based on pre-trained self-supervised representations. We report results for different fusion strategies and varying context sizes. This study also questions the problem of performance variability by analyzing the differences in results for multiple training runs with random initialization. The measured accuracy would allow the use of these models for large-scale analysis of turn-taking.

Detecting the terminality of speech-turn boundary for spoken interactions in French TV and Radio content

TL;DR

The paper addresses automatic detection of Terminal turns, or Transition Relevance Places (TRPs), in spontaneous French broadcast dialogues to enable large-scale turn-taking analysis. It employs multimodal models that fuse audio (wav2vec2-base) and text (FlauBERT) representations, comparing Audio Only, Text Only, and fusion strategies (Early, Late, Average) on a French broadcast corpus. The study reports mean accuracies above 0.85 across configurations, with Audio Only and Early Fusion delivering the strongest performance, and shows that fully automatic 3-second window inputs still achieve over 90% accuracy, demonstrating robustness to preprocessing. By releasing code and the dataset, the work supports reproducible, scalable analysis of turn-taking in media and potential applications in human–machine interaction and sociolinguistic research.

Abstract

Transition Relevance Places are defined as the end of an utterance where the interlocutor may take the floor without interrupting the current speaker --i.e., a place where the turn is terminal. Analyzing turn terminality is useful to study the dynamic of turn-taking in spontaneous conversations. This paper presents an automatic classification of spoken utterances as Terminal or Non-Terminal in multi-speaker settings. We compared audio, text, and fusions of both approaches on a French corpus of TV and Radio extracts annotated with turn-terminality information at each speaker change. Our models are based on pre-trained self-supervised representations. We report results for different fusion strategies and varying context sizes. This study also questions the problem of performance variability by analyzing the differences in results for multiple training runs with random initialization. The measured accuracy would allow the use of these models for large-scale analysis of turn-taking.
Paper Structure (11 sections, 1 equation, 4 figures, 4 tables)

This paper contains 11 sections, 1 equation, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Single modality model architecture
  • Figure 2: Early fusion model architecture
  • Figure 3: Mean accuracies for each model's Test settings (lines) with their confidence intervals, for each Train setting (subplots), for the different model architectures (x-axis).
  • Figure 4: Mean accuracies as a function of the duration of the testing sample for the different three training settings.