Visual Cues Enhance Predictive Turn-Taking for Two-Party Human Interaction
Sam O'Connor Russell, Naomi Harte
TL;DR
This paper investigates whether visual cues enhance predictive turn-taking in two-party interactions and introduces MM-VAP, a transformer-based multimodal predictive turn-taking model that fuses speech with facial expression, gaze, and head pose. Using the Candor videoconferencing corpus and ASR-aligned transcripts, MM-VAP outperforms the state-of-the-art audio-only VAP model across hold/shift predictions, with notable gains in the F1 score for shifts. An ablation study reveals facial action units as the strongest visual contributor, supporting the hypothesis that non-verbal cues are vital for turn-taking when interlocutors can see each other. The work demonstrates robust improvements across a range of silence durations and provides code for broader adoption, highlighting the practical impact of multimodal cues in real-time human-robot interaction.
Abstract
Turn-taking is richly multimodal. Predictive turn-taking models (PTTMs) facilitate naturalistic human-robot interaction, yet most rely solely on speech. We introduce MM-VAP, a multimodal PTTM which combines speech with visual cues including facial expression, head pose and gaze. We find that it outperforms the state-of-the-art audio-only in videoconferencing interactions (84% vs. 79% hold/shift prediction accuracy). Unlike prior work which aggregates all holds and shifts, we group by duration of silence between turns. This reveals that through the inclusion of visual features, MM-VAP outperforms a state-of-the-art audio-only turn-taking model across all durations of speaker transitions. We conduct a detailed ablation study, which reveals that facial expression features contribute the most to model performance. Thus, our working hypothesis is that when interlocutors can see one another, visual cues are vital for turn-taking and must therefore be included for accurate turn-taking prediction. We additionally validate the suitability of automatic speech alignment for PTTM training using telephone speech. This work represents the first comprehensive analysis of multimodal PTTMs. We discuss implications for future work and make all code publicly available.
