Table of Contents
Fetching ...

Visual Cues Enhance Predictive Turn-Taking for Two-Party Human Interaction

Sam O'Connor Russell, Naomi Harte

TL;DR

This paper investigates whether visual cues enhance predictive turn-taking in two-party interactions and introduces MM-VAP, a transformer-based multimodal predictive turn-taking model that fuses speech with facial expression, gaze, and head pose. Using the Candor videoconferencing corpus and ASR-aligned transcripts, MM-VAP outperforms the state-of-the-art audio-only VAP model across hold/shift predictions, with notable gains in the F1 score for shifts. An ablation study reveals facial action units as the strongest visual contributor, supporting the hypothesis that non-verbal cues are vital for turn-taking when interlocutors can see each other. The work demonstrates robust improvements across a range of silence durations and provides code for broader adoption, highlighting the practical impact of multimodal cues in real-time human-robot interaction.

Abstract

Turn-taking is richly multimodal. Predictive turn-taking models (PTTMs) facilitate naturalistic human-robot interaction, yet most rely solely on speech. We introduce MM-VAP, a multimodal PTTM which combines speech with visual cues including facial expression, head pose and gaze. We find that it outperforms the state-of-the-art audio-only in videoconferencing interactions (84% vs. 79% hold/shift prediction accuracy). Unlike prior work which aggregates all holds and shifts, we group by duration of silence between turns. This reveals that through the inclusion of visual features, MM-VAP outperforms a state-of-the-art audio-only turn-taking model across all durations of speaker transitions. We conduct a detailed ablation study, which reveals that facial expression features contribute the most to model performance. Thus, our working hypothesis is that when interlocutors can see one another, visual cues are vital for turn-taking and must therefore be included for accurate turn-taking prediction. We additionally validate the suitability of automatic speech alignment for PTTM training using telephone speech. This work represents the first comprehensive analysis of multimodal PTTMs. We discuss implications for future work and make all code publicly available.

Visual Cues Enhance Predictive Turn-Taking for Two-Party Human Interaction

TL;DR

This paper investigates whether visual cues enhance predictive turn-taking in two-party interactions and introduces MM-VAP, a transformer-based multimodal predictive turn-taking model that fuses speech with facial expression, gaze, and head pose. Using the Candor videoconferencing corpus and ASR-aligned transcripts, MM-VAP outperforms the state-of-the-art audio-only VAP model across hold/shift predictions, with notable gains in the F1 score for shifts. An ablation study reveals facial action units as the strongest visual contributor, supporting the hypothesis that non-verbal cues are vital for turn-taking when interlocutors can see each other. The work demonstrates robust improvements across a range of silence durations and provides code for broader adoption, highlighting the practical impact of multimodal cues in real-time human-robot interaction.

Abstract

Turn-taking is richly multimodal. Predictive turn-taking models (PTTMs) facilitate naturalistic human-robot interaction, yet most rely solely on speech. We introduce MM-VAP, a multimodal PTTM which combines speech with visual cues including facial expression, head pose and gaze. We find that it outperforms the state-of-the-art audio-only in videoconferencing interactions (84% vs. 79% hold/shift prediction accuracy). Unlike prior work which aggregates all holds and shifts, we group by duration of silence between turns. This reveals that through the inclusion of visual features, MM-VAP outperforms a state-of-the-art audio-only turn-taking model across all durations of speaker transitions. We conduct a detailed ablation study, which reveals that facial expression features contribute the most to model performance. Thus, our working hypothesis is that when interlocutors can see one another, visual cues are vital for turn-taking and must therefore be included for accurate turn-taking prediction. We additionally validate the suitability of automatic speech alignment for PTTM training using telephone speech. This work represents the first comprehensive analysis of multimodal PTTMs. We discuss implications for future work and make all code publicly available.

Paper Structure

This paper contains 34 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Still from Candor with OpenFace features shown for the left participant
  • Figure 2: Schematic depicting a shift between speakers
  • Figure 3: The VAP training objective introduced by ekstedt22_interspeech which captures speaking activity in the next 2 seconds in a two-party interaction.
  • Figure 4: Schematic of our transformer-based multimodal predictive turn-taking model (late fusion version), incorporating audio and video from both speakers.
  • Figure 5: Median FAU intensity in Candor during random speech and silence, and before holds and shifts.
  • ...and 1 more figures