Visual Cues Support Robust Turn-taking Prediction in Noise
Sam O'Connor Russell, Naomi Harte
TL;DR
The paper investigates how predictive turn-taking models ($PTTM$) perform under background noise and whether visual cues improve robustness compared with audio-only baselines. Using two transformer-based models, an audio-only VAP and a multimodal MM-VAP with OpenFace visual features, evaluated on the Candor corpus, the authors show that clean conditions yield 84% vs 80% accuracy, but noise degrades performance toward around $50\%$ for music and speech when trained on clean data. Training with noisy data enables MM-VAP to reach up to 72% accuracy in music at $10\,\text{dB}$ SNR and provides an effective SNR gain of about $+10\,\text{dB}$ over audio-only in the same condition, though performance does not always generalize to unseen noises. Generalisation to unseen noise types is limited and alignment accuracy strongly influences training outcomes, with ASR-derived transcripts in noise significantly reducing performance; overall, the work shows that multimodal cues can yield more robust PTTMs in noise when trained with representative noisy data, marking progress toward robust human-robot turn-taking.
Abstract
Accurate predictive turn-taking models (PTTMs) are essential for naturalistic human-robot interaction. However, little is known about their performance in noise. This study therefore explores PTTM performance in types of noise likely to be encountered once deployed. Our analyses reveal PTTMs are highly sensitive to noise. Hold/shift accuracy drops from 84% in clean speech to just 52% in 10 dB music noise. Training with noisy data enables a multimodal PTTM, which includes visual features to better exploit visual cues, with 72% accuracy in 10 dB music noise. The multimodal PTTM outperforms the audio-only PTTM across all noise types and SNRs, highlighting its ability to exploit visual cues; however, this does not always generalise to new types of noise. Analysis also reveals that successful training relies on accurate transcription, limiting the use of ASR-derived transcriptions to clean conditions. We make code publicly available for future research.
