Table of Contents
Fetching ...

Triadic Multi-party Voice Activity Projection for Turn-taking in Spoken Dialogue Systems

Mikey Elmers, Koji Inoue, Divesh Lala, Tatsuya Kawahara

TL;DR

This work tackles turn-taking in multi-party dialogue by extending Voice Activity Projection (VAP) to triadic conversations. It introduces a triadic VAP framework using a CPC encoder and multi-channel cross-attention to jointly predict future voice activity for three speakers, trained on the TEIDAN Japanese triadic corpus with both spontaneous and attentive listening data. The results show that triadic VAP models outperform baselines across configurations, with the best performance achieved when combining spontaneous and attentive data, though prediction accuracy is influenced by conversation type and overlap. The study highlights the potential for integrating triadic VAP into spoken dialogue systems, while acknowledging limitations of acoustic-only cues and the need for multi-modal information and larger group-size evaluation for robust real-world deployment.

Abstract

Turn-taking is a fundamental component of spoken dialogue, however conventional studies mostly involve dyadic settings. This work focuses on applying voice activity projection (VAP) to predict upcoming turn-taking in triadic multi-party scenarios. The goal of VAP models is to predict the future voice activity for each speaker utilizing only acoustic data. This is the first study to extend VAP into triadic conversation. We trained multiple models on a Japanese triadic dataset where participants discussed a variety of topics. We found that the VAP trained on triadic conversation outperformed the baseline for all models but that the type of conversation affected the accuracy. This study establishes that VAP can be used for turn-taking in triadic dialogue scenarios. Future work will incorporate this triadic VAP turn-taking model into spoken dialogue systems.

Triadic Multi-party Voice Activity Projection for Turn-taking in Spoken Dialogue Systems

TL;DR

This work tackles turn-taking in multi-party dialogue by extending Voice Activity Projection (VAP) to triadic conversations. It introduces a triadic VAP framework using a CPC encoder and multi-channel cross-attention to jointly predict future voice activity for three speakers, trained on the TEIDAN Japanese triadic corpus with both spontaneous and attentive listening data. The results show that triadic VAP models outperform baselines across configurations, with the best performance achieved when combining spontaneous and attentive data, though prediction accuracy is influenced by conversation type and overlap. The study highlights the potential for integrating triadic VAP into spoken dialogue systems, while acknowledging limitations of acoustic-only cues and the need for multi-modal information and larger group-size evaluation for robust real-world deployment.

Abstract

Turn-taking is a fundamental component of spoken dialogue, however conventional studies mostly involve dyadic settings. This work focuses on applying voice activity projection (VAP) to predict upcoming turn-taking in triadic multi-party scenarios. The goal of VAP models is to predict the future voice activity for each speaker utilizing only acoustic data. This is the first study to extend VAP into triadic conversation. We trained multiple models on a Japanese triadic dataset where participants discussed a variety of topics. We found that the VAP trained on triadic conversation outperformed the baseline for all models but that the type of conversation affected the accuracy. This study establishes that VAP can be used for turn-taking in triadic dialogue scenarios. Future work will incorporate this triadic VAP turn-taking model into spoken dialogue systems.

Paper Structure

This paper contains 14 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Discretized bins for triadic VAP model. The dashed vertical line demarcates the bins and the segments indicate speaker voice activity.
  • Figure 2: Example session setup from TEIDAN corpus.
  • Figure 3: Triadic VAP model architecture.
  • Figure 4: The top three plots show the ground truth waveforms (amplitude) for each of the three speakers, while the bottom three depict the model's predicted voice activity probabilities (0 to 1). Colors denote speaker: blue (1), orange (2), and green (3).