Table of Contents
Fetching ...

Lla-VAP: LSTM Ensemble of Llama and VAP for Turn-Taking Prediction

Hyunbae Jeon, Frederic Guintu, Rayvant Sahni

TL;DR

This work tackles turn-taking prediction by integrating linguistic context from an LLM with acoustic cues from a VAP model in a multi-modal Lla-VAP ensemble. Evaluated on the CCPE and ICC datasets, the approach demonstrates that temporal modeling with an LSTM ensemble yields the strongest performance, especially for turn-final TRPs, while within-turn predictions remain challenging. Prompt engineering significantly influences LLM decisions, and real-time capability is preserved across models, enabling practical deployment for naturalistic dialogue systems. The findings underscore the value of combining lexical and acoustic information to predict TRPs more robustly in diverse conversational contexts.

Abstract

Turn-taking prediction is the task of anticipating when the speaker in a conversation will yield their turn to another speaker to begin speaking. This project expands on existing strategies for turn-taking prediction by employing a multi-modal ensemble approach that integrates large language models (LLMs) and voice activity projection (VAP) models. By combining the linguistic capabilities of LLMs with the temporal precision of VAP models, we aim to improve the accuracy and efficiency of identifying TRPs in both scripted and unscripted conversational scenarios. Our methods are evaluated on the In-Conversation Corpus (ICC) and Coached Conversational Preference Elicitation (CCPE) datasets, highlighting the strengths and limitations of current models while proposing a potentially more robust framework for enhanced prediction.

Lla-VAP: LSTM Ensemble of Llama and VAP for Turn-Taking Prediction

TL;DR

This work tackles turn-taking prediction by integrating linguistic context from an LLM with acoustic cues from a VAP model in a multi-modal Lla-VAP ensemble. Evaluated on the CCPE and ICC datasets, the approach demonstrates that temporal modeling with an LSTM ensemble yields the strongest performance, especially for turn-final TRPs, while within-turn predictions remain challenging. Prompt engineering significantly influences LLM decisions, and real-time capability is preserved across models, enabling practical deployment for naturalistic dialogue systems. The findings underscore the value of combining lexical and acoustic information to predict TRPs more robustly in diverse conversational contexts.

Abstract

Turn-taking prediction is the task of anticipating when the speaker in a conversation will yield their turn to another speaker to begin speaking. This project expands on existing strategies for turn-taking prediction by employing a multi-modal ensemble approach that integrates large language models (LLMs) and voice activity projection (VAP) models. By combining the linguistic capabilities of LLMs with the temporal precision of VAP models, we aim to improve the accuracy and efficiency of identifying TRPs in both scripted and unscripted conversational scenarios. Our methods are evaluated on the In-Conversation Corpus (ICC) and Coached Conversational Preference Elicitation (CCPE) datasets, highlighting the strengths and limitations of current models while proposing a potentially more robust framework for enhanced prediction.

Paper Structure

This paper contains 34 sections, 5 figures, 3 tables.

Figures (5)

  • Figure 1: LSTM ensemble training progression on CCPE dataset. (Top-left) Training and validation loss showing convergence. (Top-right) Balanced accuracy achieving and maintaining approximately 95% after epoch 5. (Bottom-left) Sensitivity/specificity demonstrating balanced learning. (Bottom-right) Positive prediction ratio approaching true distribution (indicated by dashed line).
  • Figure 2: VAP showing less stable predictions
  • Figure 3: Basic prompt inference showing less stable predictions
  • Figure 4: Enhanced prompt inference with improved temporal alignment
  • Figure 5: LSTM ensemble inference demonstrating more precise turn-taking predictions