Lla-VAP: LSTM Ensemble of Llama and VAP for Turn-Taking Prediction

Hyunbae Jeon; Frederic Guintu; Rayvant Sahni

Lla-VAP: LSTM Ensemble of Llama and VAP for Turn-Taking Prediction

Hyunbae Jeon, Frederic Guintu, Rayvant Sahni

TL;DR

This work tackles turn-taking prediction by integrating linguistic context from an LLM with acoustic cues from a VAP model in a multi-modal Lla-VAP ensemble. Evaluated on the CCPE and ICC datasets, the approach demonstrates that temporal modeling with an LSTM ensemble yields the strongest performance, especially for turn-final TRPs, while within-turn predictions remain challenging. Prompt engineering significantly influences LLM decisions, and real-time capability is preserved across models, enabling practical deployment for naturalistic dialogue systems. The findings underscore the value of combining lexical and acoustic information to predict TRPs more robustly in diverse conversational contexts.

Abstract

Turn-taking prediction is the task of anticipating when the speaker in a conversation will yield their turn to another speaker to begin speaking. This project expands on existing strategies for turn-taking prediction by employing a multi-modal ensemble approach that integrates large language models (LLMs) and voice activity projection (VAP) models. By combining the linguistic capabilities of LLMs with the temporal precision of VAP models, we aim to improve the accuracy and efficiency of identifying TRPs in both scripted and unscripted conversational scenarios. Our methods are evaluated on the In-Conversation Corpus (ICC) and Coached Conversational Preference Elicitation (CCPE) datasets, highlighting the strengths and limitations of current models while proposing a potentially more robust framework for enhanced prediction.

Lla-VAP: LSTM Ensemble of Llama and VAP for Turn-Taking Prediction

TL;DR

Abstract

Lla-VAP: LSTM Ensemble of Llama and VAP for Turn-Taking Prediction

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)