Multilingual Turn-taking Prediction Using Voice Activity Projection

Koji Inoue; Bing'er Jiang; Erik Ekstedt; Tatsuya Kawahara; Gabriel Skantze

Multilingual Turn-taking Prediction Using Voice Activity Projection

Koji Inoue, Bing'er Jiang, Erik Ekstedt, Tatsuya Kawahara, Gabriel Skantze

TL;DR

The paper tackles multilingual turn-taking in spoken dialogue by extending Voice Activity Projection (VAP) with a cross-attention Transformer to predict near-future two-speaker activity across English, Mandarin, and Japanese. A multilingual VAP model trained on all three languages achieves performance comparable to monolingual models, addressing cross-language transfer issues observed with language-specific training. The authors show that the multilingual model implicitly learns language identification (achieving near-perfect accuracy) and that pitch cues influence turn-taking predictions more in Mandarin and Japanese than in English; CPC-based encoders outperform Massively Multilingual Speech (MMS) encoders in this setup. These findings support language-agnostic turn-taking prediction in multilingual dialogue systems and highlight the importance of encoder choice for multilingual audio representations.

Abstract

This paper investigates the application of voice activity projection (VAP), a predictive turn-taking model for spoken dialogue, on multilingual data, encompassing English, Mandarin, and Japanese. The VAP model continuously predicts the upcoming voice activities of participants in dyadic dialogue, leveraging a cross-attention Transformer to capture the dynamic interplay between participants. The results show that a monolingual VAP model trained on one language does not make good predictions when applied to other languages. However, a multilingual model, trained on all three languages, demonstrates predictive performance on par with monolingual models across all languages. Further analyses show that the multilingual model has learned to discern the language of the input signal. We also analyze the sensitivity to pitch, a prosodic cue that is thought to be important for turn-taking. Finally, we compare two different audio encoders, contrastive predictive coding (CPC) pre-trained on English, with a recent model based on multilingual wav2vec 2.0 (MMS).

Multilingual Turn-taking Prediction Using Voice Activity Projection

TL;DR

Abstract

Paper Structure (20 sections, 3 equations, 6 figures, 6 tables)

This paper contains 20 sections, 3 equations, 6 figures, 6 tables.

Introduction
Voice Activity Projection (VAP)
Model Architecture
VAP State
Loss Function
Turn-taking Prediction Using VAP
Datasets
Switchboard (English)
HKUST Mandarin Telephone Speech
Travel Agency Task Dialogues (Japanese)
Differences Across Languages
Experiments
Cross-lingual Performance
Condition
Test Loss Performance
...and 5 more sections

Figures (6)

Figure 1: Architecture of the VAP model
Figure 2: Discretizing bins for the VAP model
Figure 3: Histogram of turn-shift gap in three languages
Figure 4: Histogram of turn-hold gap in three languages
Figure 5: Output example of multilingual VAP in three languages (Top: English, Middle: Mandarin, Bottom: Japanese) - Each graph consists of, from top to bottom, input waveforms of both participants, near future voiced probability ($p_{now}$), and future voiced probability ($p_{future}$) among participants.
...and 1 more figures

Multilingual Turn-taking Prediction Using Voice Activity Projection

TL;DR

Abstract

Multilingual Turn-taking Prediction Using Voice Activity Projection

Authors

TL;DR

Abstract

Table of Contents

Figures (6)