Table of Contents
Fetching ...

Beyond Words: Multimodal LLM Knows When to Speak

Zikai Liao, Yi Ouyang, Yi-Lun Lee, Chen-Ping Yu, Yi-Hsuan Tsai, Zhaozheng Yin

TL;DR

The paper tackles the challenge of predicting when a conversational agent should speak, specifically focusing on brief, mid-turn reactions. It introduces MM-When2Speak, a multimodal LLM-based system that fuses video, audio, and text through an Encoder-Adaptor-LLM architecture and a sliding-window dense classification framework, with $\Delta t=10$ and $\delta=0.5$ to generate time-aligned predictions. Two curated datasets—Short-Clips and Full-Videos—enable fine-grained evaluation of reaction timing and turn-taking in dyadic conversations, with labels including seven reactions plus full\_response and silence. Empirical results show that multimodal fusion and self-attention fusion outperform state-of-the-art unimodal and LLM baselines, achieving up to $4\times$ improvements in speech-timing accuracy and robust gains across reaction types, though full-video performance can be affected by class imbalance. The work demonstrates the practical value of grounding speak-time decisions in multimodal cues for more natural, engaging conversational AI, while outlining limitations related to multi-party generalization and annotation noise and proposing extensions to model both when to speak and what to say.

Abstract

While large language model (LLM)-based chatbots have demonstrated strong capabilities in generating coherent and contextually relevant responses, they often struggle with understanding when to speak, particularly in delivering brief, timely reactions during ongoing conversations. This limitation arises largely from their reliance on text input, lacking the rich contextual cues in real-world human dialogue. In this work, we focus on real-time prediction of response types, with an emphasis on short, reactive utterances that depend on subtle, multimodal signals across vision, audio, and text. To support this, we introduce a new multimodal dataset constructed from real-world conversational videos, containing temporally aligned visual, auditory, and textual streams. This dataset enables fine-grained modeling of response timing in dyadic interactions. Building on this dataset, we propose MM-When2Speak, a multimodal LLM-based model that adaptively integrates visual, auditory, and textual context to predict when a response should occur, and what type of response is appropriate. Experiments show that MM-When2Speak significantly outperforms state-of-the-art unimodal and LLM-based baselines, achieving up to a 4x improvement in response timing accuracy over leading commercial LLMs. These results underscore the importance of multimodal inputs for producing timely, natural, and engaging conversational AI.

Beyond Words: Multimodal LLM Knows When to Speak

TL;DR

The paper tackles the challenge of predicting when a conversational agent should speak, specifically focusing on brief, mid-turn reactions. It introduces MM-When2Speak, a multimodal LLM-based system that fuses video, audio, and text through an Encoder-Adaptor-LLM architecture and a sliding-window dense classification framework, with and to generate time-aligned predictions. Two curated datasets—Short-Clips and Full-Videos—enable fine-grained evaluation of reaction timing and turn-taking in dyadic conversations, with labels including seven reactions plus full\_response and silence. Empirical results show that multimodal fusion and self-attention fusion outperform state-of-the-art unimodal and LLM baselines, achieving up to improvements in speech-timing accuracy and robust gains across reaction types, though full-video performance can be affected by class imbalance. The work demonstrates the practical value of grounding speak-time decisions in multimodal cues for more natural, engaging conversational AI, while outlining limitations related to multi-party generalization and annotation noise and proposing extensions to model both when to speak and what to say.

Abstract

While large language model (LLM)-based chatbots have demonstrated strong capabilities in generating coherent and contextually relevant responses, they often struggle with understanding when to speak, particularly in delivering brief, timely reactions during ongoing conversations. This limitation arises largely from their reliance on text input, lacking the rich contextual cues in real-world human dialogue. In this work, we focus on real-time prediction of response types, with an emphasis on short, reactive utterances that depend on subtle, multimodal signals across vision, audio, and text. To support this, we introduce a new multimodal dataset constructed from real-world conversational videos, containing temporally aligned visual, auditory, and textual streams. This dataset enables fine-grained modeling of response timing in dyadic interactions. Building on this dataset, we propose MM-When2Speak, a multimodal LLM-based model that adaptively integrates visual, auditory, and textual context to predict when a response should occur, and what type of response is appropriate. Experiments show that MM-When2Speak significantly outperforms state-of-the-art unimodal and LLM-based baselines, achieving up to a 4x improvement in response timing accuracy over leading commercial LLMs. These results underscore the importance of multimodal inputs for producing timely, natural, and engaging conversational AI.

Paper Structure

This paper contains 14 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Overview of our design. For a multimodal input with video, audio, and text, our MM-When2Speak uses a sliding window to densely sample short clips for speak-time prediction, transforming "when to speak" problem to a classification task. At each sampled timestamp, our MM-When2Speak will output a specific label, indicating whether to keep silent, give a short reaction (e.g., Affirmation), or start speaking. This simple design enables accurate and prompt speak-time prediction in real-world conversations.
  • Figure 2: Architecture overview of our MM-When2Speak. It encodes videos frames, spectrogram features and tokenized texts for multimodal information perception, and adaptively combines these attentive information for the LLM to accurately identify the correct speak-time prediction.
  • Figure 3: Confusion matrices for MM-When2Speak. Each row compares different modalities, while each column compares the model with or without self-attention. Values are row-normalized.