Detecting the terminality of speech-turn boundary for spoken interactions in French TV and Radio content
Rémi Uro, Marie Tahon, David Doukhan, Antoine Laurent, Albert Rilliard
TL;DR
The paper addresses automatic detection of Terminal turns, or Transition Relevance Places (TRPs), in spontaneous French broadcast dialogues to enable large-scale turn-taking analysis. It employs multimodal models that fuse audio (wav2vec2-base) and text (FlauBERT) representations, comparing Audio Only, Text Only, and fusion strategies (Early, Late, Average) on a French broadcast corpus. The study reports mean accuracies above 0.85 across configurations, with Audio Only and Early Fusion delivering the strongest performance, and shows that fully automatic 3-second window inputs still achieve over 90% accuracy, demonstrating robustness to preprocessing. By releasing code and the dataset, the work supports reproducible, scalable analysis of turn-taking in media and potential applications in human–machine interaction and sociolinguistic research.
Abstract
Transition Relevance Places are defined as the end of an utterance where the interlocutor may take the floor without interrupting the current speaker --i.e., a place where the turn is terminal. Analyzing turn terminality is useful to study the dynamic of turn-taking in spontaneous conversations. This paper presents an automatic classification of spoken utterances as Terminal or Non-Terminal in multi-speaker settings. We compared audio, text, and fusions of both approaches on a French corpus of TV and Radio extracts annotated with turn-terminality information at each speaker change. Our models are based on pre-trained self-supervised representations. We report results for different fusion strategies and varying context sizes. This study also questions the problem of performance variability by analyzing the differences in results for multiple training runs with random initialization. The measured accuracy would allow the use of these models for large-scale analysis of turn-taking.
