Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

Jinhan Wang; Long Chen; Aparna Khare; Anirudh Raju; Pranav Dheram; Di He; Minhua Wu; Andreas Stolcke; Venkatesh Ravichandran

Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

Jinhan Wang, Long Chen, Aparna Khare, Anirudh Raju, Pranav Dheram, Di He, Minhua Wu, Andreas Stolcke, Venkatesh Ravichandran

TL;DR

The paper addresses predicting turn-taking and backchannel locations in spoken dialogue by fusing HuBERT-based acoustic representations with LLM embeddings (GPT-2 and RedPajama). It introduces a late-fusion framework with two configurations (Opt1/Opt2) and a novel multi-task instruction-fine-tuning scheme to leverage task descriptions and dialogue history, optimizing $P(Y|X^A,X^L)$ for $Y \in \{Continuing Speech, Backchannel, Turn-taking\}$. On Switchboard, fusion approaches outperform single-modality baselines, with RedPajama plus instruction tuning and history achieving the strongest performance (average AUC around 0.88). The results demonstrate that combining acoustic signals with LLM-based language understanding yields more natural, responsive human-agent dialogue as evidenced by improvements in turn-taking and backchannel prediction.

Abstract

We propose an approach for continuous prediction of turn-taking and backchanneling locations in spoken dialogue by fusing a neural acoustic model with a large language model (LLM). Experiments on the Switchboard human-human conversation dataset demonstrate that our approach consistently outperforms the baseline models with single modality. We also develop a novel multi-task instruction fine-tuning strategy to further benefit from LLM-encoded knowledge for understanding the tasks and conversational contexts, leading to additional improvements. Our approach demonstrates the potential of combined LLMs and acoustic models for a more natural and conversational interaction between humans and speech-enabled AI agents.

Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

TL;DR

for

. On Switchboard, fusion approaches outperform single-modality baselines, with RedPajama plus instruction tuning and history achieving the strongest performance (average AUC around 0.88). The results demonstrate that combining acoustic signals with LLM-based language understanding yields more natural, responsive human-agent dialogue as evidenced by improvements in turn-taking and backchannel prediction.

Abstract

Paper Structure (15 sections, 4 equations, 4 figures, 2 tables)

This paper contains 15 sections, 4 equations, 4 figures, 2 tables.

Introduction
Proposed Method
Problem setup
Acoustic and language modeling
Fusion or joint training
Multi-task instruction fine-tuning
Experiments
Dataset
Training and evaluation scenarios
Experimental details
Results
Single modality versus fusion
Multi-task instruction fine-tuning
Instruction fine-tuning with dialogue history
Conclusions

Figures (4)

Figure 1: Schematic of combined acoustic and LLM modeling for turn-taking, backchannel and continuing speech prediction.
Figure 2: LLM-based multi-task instruction fine-tuning for turn-taking, backchannel, and continuing speech prediction.
Figure 3: ROC plots for turn-taking (left) and backchannel (right).
Figure 4: Left: backchannel score distribution for the positive and negative samples. Right: a sentence example with token-level backchannel score. The markers represent the ground-truth token labels.

Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

TL;DR

Abstract

Turn-taking and Backchannel Prediction with Acoustic and Large Language Model Fusion

Authors

TL;DR

Abstract

Table of Contents

Figures (4)