Table of Contents
Fetching ...

Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics

Siddhant Arora, Zhiyun Lu, Chung-Cheng Chiu, Ruoming Pang, Shinji Watanabe

TL;DR

This work tackles the problem of evaluating turn-taking in audio foundation models, arguing that natural conversations require precise timing of speaking, backchannels, interruptions, and yield cues beyond traditional ASR/TTS metrics. It introduces a timing-centric evaluation protocol powered by a judge turn-taking model trained on human–human data, and defines metrics that assess whether an AI system speaks up, backchannels, interrupts, and signals when the user can continue, both when the AI is listening and when it is speaking. The authors train a causal predictor using a 30-second context window and 40-millisecond chunks, and apply it to user-study data from Moshi and a cascaded system, plus broader benchmarks on Switchboard. They also evaluate multiple audio FMs on their ability to understand and predict turn-taking events, finding notable gaps (e.g., aggressive interruptions by some systems and limited backchanneling) and showing that even strong models like GPT-4o differ in turn-taking behavior. The paper provides an open-source evaluation platform and highlights practical implications for building more natural, interactive conversational AI systems. The work advances the field by shifting from corpus-level turn-taking statistics to timing-aware, judge-based evaluation, enabling more reliable comparison and improvement of audio FMs in real-time dialogue settings.

Abstract

The recent wave of audio foundation models (FMs) could provide new capabilities for conversational modeling. However, there have been limited efforts to evaluate these audio FMs comprehensively on their ability to have natural and interactive conversations. To engage in meaningful conversation with the end user, we would want the FMs to additionally perform a fluent succession of turns without too much overlapping speech or long stretches of silence. Inspired by this, we ask whether the recently proposed audio FMs can understand, predict, and perform turn-taking events? To answer this, we propose a novel evaluation protocol that can assess spoken dialog system's turn-taking capabilities using a supervised model as a judge that has been trained to predict turn-taking events in human-human conversations. Using this protocol, we present the first comprehensive user study that evaluates existing spoken dialogue systems on their ability to perform turn-taking events and reveal many interesting insights, such as they sometimes do not understand when to speak up, can interrupt too aggressively and rarely backchannel. We further evaluate multiple open-source and proprietary audio FMs accessible through APIs on carefully curated test benchmarks from Switchboard to measure their ability to understand and predict turn-taking events and identify significant room for improvement. We will open source our evaluation platform to promote the development of advanced conversational AI systems.

Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics

TL;DR

This work tackles the problem of evaluating turn-taking in audio foundation models, arguing that natural conversations require precise timing of speaking, backchannels, interruptions, and yield cues beyond traditional ASR/TTS metrics. It introduces a timing-centric evaluation protocol powered by a judge turn-taking model trained on human–human data, and defines metrics that assess whether an AI system speaks up, backchannels, interrupts, and signals when the user can continue, both when the AI is listening and when it is speaking. The authors train a causal predictor using a 30-second context window and 40-millisecond chunks, and apply it to user-study data from Moshi and a cascaded system, plus broader benchmarks on Switchboard. They also evaluate multiple audio FMs on their ability to understand and predict turn-taking events, finding notable gaps (e.g., aggressive interruptions by some systems and limited backchanneling) and showing that even strong models like GPT-4o differ in turn-taking behavior. The paper provides an open-source evaluation platform and highlights practical implications for building more natural, interactive conversational AI systems. The work advances the field by shifting from corpus-level turn-taking statistics to timing-aware, judge-based evaluation, enabling more reliable comparison and improvement of audio FMs in real-time dialogue settings.

Abstract

The recent wave of audio foundation models (FMs) could provide new capabilities for conversational modeling. However, there have been limited efforts to evaluate these audio FMs comprehensively on their ability to have natural and interactive conversations. To engage in meaningful conversation with the end user, we would want the FMs to additionally perform a fluent succession of turns without too much overlapping speech or long stretches of silence. Inspired by this, we ask whether the recently proposed audio FMs can understand, predict, and perform turn-taking events? To answer this, we propose a novel evaluation protocol that can assess spoken dialog system's turn-taking capabilities using a supervised model as a judge that has been trained to predict turn-taking events in human-human conversations. Using this protocol, we present the first comprehensive user study that evaluates existing spoken dialogue systems on their ability to perform turn-taking events and reveal many interesting insights, such as they sometimes do not understand when to speak up, can interrupt too aggressively and rarely backchannel. We further evaluate multiple open-source and proprietary audio FMs accessible through APIs on carefully curated test benchmarks from Switchboard to measure their ability to understand and predict turn-taking events and identify significant room for improvement. We will open source our evaluation platform to promote the development of advanced conversational AI systems.

Paper Structure

This paper contains 28 sections, 10 equations, 7 figures, 14 tables.

Figures (7)

  • Figure 1: Overview of turn-taking events in human-human conversation
  • Figure 2: Results of audio foundation models on engaging in conversation with humans based on corpus-level statistics proposed in prior works Dialog_GSLM.
  • Figure 3: The results show the consistency of the AI dialogue system's turn-taking decisions with judge labels across our proposed metrics. The first 3 graphs correspond to when AI system is listener and the remaining 2 graphs correspond to when AI system is speaker. Additionally, 95% confidence intervals are provided for AI system with all metrics (also in Appendix Tab. \ref{['tab:metric-confidence']}). For each graph, the first two bars represent the consistency of our computed judge labels with human relevance judgments obtained from both an in-domain and out-of-domain spoken dialogue corpus.
  • Figure 4: Confusion Matrix showing the performance of the turn-taking decisions $L^\text{dialogue}$ made by the AI systems using the supervised turn-taking model as judge (i.e. $L^\text{gen}$ as the ground truth). The numbers in the confusion matrix represent percentages.
  • Figure 5: Screenshot of Moshi demo as shown to the participants during user study
  • ...and 2 more figures