Table of Contents
Fetching ...

Yeah, Un, Oh: Continuous and Real-time Backchannel Prediction with Fine-tuning of Voice Activity Projection

Koji Inoue, Divesh Lala, Gabriel Skantze, Tatsuya Kawahara

TL;DR

This work tackles real-time, continuous backchannel prediction by fine-tuning a Transformer-based Voice Activity Projection (VAP) model on backchannel data after broad pre-training on general dialogue. It defines two tasks—timing and timing-with-type prediction—and demonstrates that a two-stage, multi-task training regime yields the strongest performance on unbalanced, real-world data. Empirical results show improvements over baselines in both timing and type accuracy, with robust real-time processing on CPU and insights into the role of prosody. The approach, integrated into a CG agent, indicates practical potential for more natural, responsive spoken dialogue systems, albeit with language limitations and a need for cross-language evaluation.

Abstract

In human conversations, short backchannel utterances such as "yeah" and "oh" play a crucial role in facilitating smooth and engaging dialogue. These backchannels signal attentiveness and understanding without interrupting the speaker, making their accurate prediction essential for creating more natural conversational agents. This paper proposes a novel method for real-time, continuous backchannel prediction using a fine-tuned Voice Activity Projection (VAP) model. While existing approaches have relied on turn-based or artificially balanced datasets, our approach predicts both the timing and type of backchannels in a continuous and frame-wise manner on unbalanced, real-world datasets. We first pre-train the VAP model on a general dialogue corpus to capture conversational dynamics and then fine-tune it on a specialized dataset focused on backchannel behavior. Experimental results demonstrate that our model outperforms baseline methods in both timing and type prediction tasks, achieving robust performance in real-time environments. This research offers a promising step toward more responsive and human-like dialogue systems, with implications for interactive spoken dialogue applications such as virtual assistants and robots.

Yeah, Un, Oh: Continuous and Real-time Backchannel Prediction with Fine-tuning of Voice Activity Projection

TL;DR

This work tackles real-time, continuous backchannel prediction by fine-tuning a Transformer-based Voice Activity Projection (VAP) model on backchannel data after broad pre-training on general dialogue. It defines two tasks—timing and timing-with-type prediction—and demonstrates that a two-stage, multi-task training regime yields the strongest performance on unbalanced, real-world data. Empirical results show improvements over baselines in both timing and type accuracy, with robust real-time processing on CPU and insights into the role of prosody. The approach, integrated into a CG agent, indicates practical potential for more natural, responsive spoken dialogue systems, albeit with language limitations and a need for cross-language evaluation.

Abstract

In human conversations, short backchannel utterances such as "yeah" and "oh" play a crucial role in facilitating smooth and engaging dialogue. These backchannels signal attentiveness and understanding without interrupting the speaker, making their accurate prediction essential for creating more natural conversational agents. This paper proposes a novel method for real-time, continuous backchannel prediction using a fine-tuned Voice Activity Projection (VAP) model. While existing approaches have relied on turn-based or artificially balanced datasets, our approach predicts both the timing and type of backchannels in a continuous and frame-wise manner on unbalanced, real-world datasets. We first pre-train the VAP model on a general dialogue corpus to capture conversational dynamics and then fine-tune it on a specialized dataset focused on backchannel behavior. Experimental results demonstrate that our model outperforms baseline methods in both timing and type prediction tasks, achieving robust performance in real-time environments. This research offers a promising step toward more responsive and human-like dialogue systems, with implications for interactive spoken dialogue applications such as virtual assistants and robots.

Paper Structure

This paper contains 18 sections, 2 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Conceptual diagram of continuous backchannel prediction
  • Figure 2: Setup for dialogue recording
  • Figure 3: Definition of positive (backchannel) and negative (non-backchannel) frames
  • Figure 4: Architecture of the VAP model
  • Figure 5: Definition of the VAP state
  • ...and 5 more figures