Beyond Words: Multimodal LLM Knows When to Speak
Zikai Liao, Yi Ouyang, Yi-Lun Lee, Chen-Ping Yu, Yi-Hsuan Tsai, Zhaozheng Yin
TL;DR
The paper tackles the challenge of predicting when a conversational agent should speak, specifically focusing on brief, mid-turn reactions. It introduces MM-When2Speak, a multimodal LLM-based system that fuses video, audio, and text through an Encoder-Adaptor-LLM architecture and a sliding-window dense classification framework, with $\Delta t=10$ and $\delta=0.5$ to generate time-aligned predictions. Two curated datasets—Short-Clips and Full-Videos—enable fine-grained evaluation of reaction timing and turn-taking in dyadic conversations, with labels including seven reactions plus full\_response and silence. Empirical results show that multimodal fusion and self-attention fusion outperform state-of-the-art unimodal and LLM baselines, achieving up to $4\times$ improvements in speech-timing accuracy and robust gains across reaction types, though full-video performance can be affected by class imbalance. The work demonstrates the practical value of grounding speak-time decisions in multimodal cues for more natural, engaging conversational AI, while outlining limitations related to multi-party generalization and annotation noise and proposing extensions to model both when to speak and what to say.
Abstract
While large language model (LLM)-based chatbots have demonstrated strong capabilities in generating coherent and contextually relevant responses, they often struggle with understanding when to speak, particularly in delivering brief, timely reactions during ongoing conversations. This limitation arises largely from their reliance on text input, lacking the rich contextual cues in real-world human dialogue. In this work, we focus on real-time prediction of response types, with an emphasis on short, reactive utterances that depend on subtle, multimodal signals across vision, audio, and text. To support this, we introduce a new multimodal dataset constructed from real-world conversational videos, containing temporally aligned visual, auditory, and textual streams. This dataset enables fine-grained modeling of response timing in dyadic interactions. Building on this dataset, we propose MM-When2Speak, a multimodal LLM-based model that adaptively integrates visual, auditory, and textual context to predict when a response should occur, and what type of response is appropriate. Experiments show that MM-When2Speak significantly outperforms state-of-the-art unimodal and LLM-based baselines, achieving up to a 4x improvement in response timing accuracy over leading commercial LLMs. These results underscore the importance of multimodal inputs for producing timely, natural, and engaging conversational AI.
