Table of Contents
Fetching ...

Visual Cues Support Robust Turn-taking Prediction in Noise

Sam O'Connor Russell, Naomi Harte

TL;DR

The paper investigates how predictive turn-taking models ($PTTM$) perform under background noise and whether visual cues improve robustness compared with audio-only baselines. Using two transformer-based models, an audio-only VAP and a multimodal MM-VAP with OpenFace visual features, evaluated on the Candor corpus, the authors show that clean conditions yield 84% vs 80% accuracy, but noise degrades performance toward around $50\%$ for music and speech when trained on clean data. Training with noisy data enables MM-VAP to reach up to 72% accuracy in music at $10\,\text{dB}$ SNR and provides an effective SNR gain of about $+10\,\text{dB}$ over audio-only in the same condition, though performance does not always generalize to unseen noises. Generalisation to unseen noise types is limited and alignment accuracy strongly influences training outcomes, with ASR-derived transcripts in noise significantly reducing performance; overall, the work shows that multimodal cues can yield more robust PTTMs in noise when trained with representative noisy data, marking progress toward robust human-robot turn-taking.

Abstract

Accurate predictive turn-taking models (PTTMs) are essential for naturalistic human-robot interaction. However, little is known about their performance in noise. This study therefore explores PTTM performance in types of noise likely to be encountered once deployed. Our analyses reveal PTTMs are highly sensitive to noise. Hold/shift accuracy drops from 84% in clean speech to just 52% in 10 dB music noise. Training with noisy data enables a multimodal PTTM, which includes visual features to better exploit visual cues, with 72% accuracy in 10 dB music noise. The multimodal PTTM outperforms the audio-only PTTM across all noise types and SNRs, highlighting its ability to exploit visual cues; however, this does not always generalise to new types of noise. Analysis also reveals that successful training relies on accurate transcription, limiting the use of ASR-derived transcriptions to clean conditions. We make code publicly available for future research.

Visual Cues Support Robust Turn-taking Prediction in Noise

TL;DR

The paper investigates how predictive turn-taking models () perform under background noise and whether visual cues improve robustness compared with audio-only baselines. Using two transformer-based models, an audio-only VAP and a multimodal MM-VAP with OpenFace visual features, evaluated on the Candor corpus, the authors show that clean conditions yield 84% vs 80% accuracy, but noise degrades performance toward around for music and speech when trained on clean data. Training with noisy data enables MM-VAP to reach up to 72% accuracy in music at SNR and provides an effective SNR gain of about over audio-only in the same condition, though performance does not always generalize to unseen noises. Generalisation to unseen noise types is limited and alignment accuracy strongly influences training outcomes, with ASR-derived transcripts in noise significantly reducing performance; overall, the work shows that multimodal cues can yield more robust PTTMs in noise when trained with representative noisy data, marking progress toward robust human-robot turn-taking.

Abstract

Accurate predictive turn-taking models (PTTMs) are essential for naturalistic human-robot interaction. However, little is known about their performance in noise. This study therefore explores PTTM performance in types of noise likely to be encountered once deployed. Our analyses reveal PTTMs are highly sensitive to noise. Hold/shift accuracy drops from 84% in clean speech to just 52% in 10 dB music noise. Training with noisy data enables a multimodal PTTM, which includes visual features to better exploit visual cues, with 72% accuracy in 10 dB music noise. The multimodal PTTM outperforms the audio-only PTTM across all noise types and SNRs, highlighting its ability to exploit visual cues; however, this does not always generalise to new types of noise. Analysis also reveals that successful training relies on accurate transcription, limiting the use of ASR-derived transcriptions to clean conditions. We make code publicly available for future research.

Paper Structure

This paper contains 14 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Average hold/shift prediction accuracy of VAP (audio-only) and MM-VAP (audio+video) models in each noise type. The average is across the -10 dB, +10 dB SNR range.
  • Figure 2: Average hold/shift prediction accuracy of VAP (audio-only) and MM-VAP (audio+video) models at each SNR. An average accuracy in speech, music and babble noise is shown.
  • Figure 3: Model output during a shift in the Candor corpus (unseen during training). Speaker 0 (grey): "So uhm would you feel comfortable telling me about what you do for a living?", Speaker 1 (red) "Uhm yeah well for the most part I test and tweak algorithms". VAP and MM-VAP model output is shown in clean speech (B) and 0 dB babble speech (C) and (D).