Table of Contents
Fetching ...

Large Language Models Know What To Say But Not When To Speak

Muhammad Umair, Vasanth Sarathy, JP de Ruiter

TL;DR

A novel dataset of participant-labeled within-turn TRPs is introduced and the current limitations of LLMs in modeling unscripted spoken interactions are revealed, highlighting areas for improvement and paving the way for more naturalistic dialogue systems.

Abstract

Turn-taking is a fundamental mechanism in human communication that ensures smooth and coherent verbal interactions. Recent advances in Large Language Models (LLMs) have motivated their use in improving the turn-taking capabilities of Spoken Dialogue Systems (SDS), such as their ability to respond at appropriate times. However, existing models often struggle to predict opportunities for speaking -- called Transition Relevance Places (TRPs) -- in natural, unscripted conversations, focusing only on turn-final TRPs and not within-turn TRPs. To address these limitations, we introduce a novel dataset of participant-labeled within-turn TRPs and use it to evaluate the performance of state-of-the-art LLMs in predicting opportunities for speaking. Our experiments reveal the current limitations of LLMs in modeling unscripted spoken interactions, highlighting areas for improvement and paving the way for more naturalistic dialogue systems.

Large Language Models Know What To Say But Not When To Speak

TL;DR

A novel dataset of participant-labeled within-turn TRPs is introduced and the current limitations of LLMs in modeling unscripted spoken interactions are revealed, highlighting areas for improvement and paving the way for more naturalistic dialogue systems.

Abstract

Turn-taking is a fundamental mechanism in human communication that ensures smooth and coherent verbal interactions. Recent advances in Large Language Models (LLMs) have motivated their use in improving the turn-taking capabilities of Spoken Dialogue Systems (SDS), such as their ability to respond at appropriate times. However, existing models often struggle to predict opportunities for speaking -- called Transition Relevance Places (TRPs) -- in natural, unscripted conversations, focusing only on turn-final TRPs and not within-turn TRPs. To address these limitations, we introduce a novel dataset of participant-labeled within-turn TRPs and use it to evaluate the performance of state-of-the-art LLMs in predicting opportunities for speaking. Our experiments reveal the current limitations of LLMs in modeling unscripted spoken interactions, highlighting areas for improvement and paving the way for more naturalistic dialogue systems.

Paper Structure

This paper contains 14 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Participants listened to a stimulus ($S$) and produced auditory responses ($R$) to indicate their perception of TRPs. Each word in the stimulus ($w_{1},t_{w_{1}}^s,t_{w_{1}}^e$) and the response ($\tilde{w}_{1},t_{\tilde{w}_{1}}^s,t_{\tilde{w}_{1}}^e$) has a start and end time. Intervals are between adjacent words ($I_{pq}$).
  • Figure 2: Distribution of participant responses, the times at which participants agreed a TRP occurred, and model predictions of TRPs for a single stimulus $S$. The dotted lines indicate that each participant-agreed TRP has some associated variance. The responses are binned between the temporal midpoint of words (see Section \ref{['sec:task']}).
  • Figure 3: Example of participant response proportions and corresponding model predictions in each interval of a sample stimulus $S$. In this example, $\tau = 0.3$, which means that there is one one interval in which participants agree that a TRP has occurred. Due to variance in human indications of TRP locations, we may consider a correct prediction to have occurred within some window of the participant-agreed TRP.

Theorems & Definitions (1)

  • Definition 4.1: Within-turn TRP Prediction