Lla-VAP: LSTM Ensemble of Llama and VAP for Turn-Taking Prediction
Hyunbae Jeon, Frederic Guintu, Rayvant Sahni
TL;DR
This work tackles turn-taking prediction by integrating linguistic context from an LLM with acoustic cues from a VAP model in a multi-modal Lla-VAP ensemble. Evaluated on the CCPE and ICC datasets, the approach demonstrates that temporal modeling with an LSTM ensemble yields the strongest performance, especially for turn-final TRPs, while within-turn predictions remain challenging. Prompt engineering significantly influences LLM decisions, and real-time capability is preserved across models, enabling practical deployment for naturalistic dialogue systems. The findings underscore the value of combining lexical and acoustic information to predict TRPs more robustly in diverse conversational contexts.
Abstract
Turn-taking prediction is the task of anticipating when the speaker in a conversation will yield their turn to another speaker to begin speaking. This project expands on existing strategies for turn-taking prediction by employing a multi-modal ensemble approach that integrates large language models (LLMs) and voice activity projection (VAP) models. By combining the linguistic capabilities of LLMs with the temporal precision of VAP models, we aim to improve the accuracy and efficiency of identifying TRPs in both scripted and unscripted conversational scenarios. Our methods are evaluated on the In-Conversation Corpus (ICC) and Coached Conversational Preference Elicitation (CCPE) datasets, highlighting the strengths and limitations of current models while proposing a potentially more robust framework for enhanced prediction.
