Table of Contents
Fetching ...

MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning

Yueqian Wang, Songxiang Liu, Disong Wang, Nuo Xu, Guanglu Wan, Huishuai Zhang, Dongyan Zhao

TL;DR

Proactive interaction in video MLLMs addresses the timing gap of turn-based systems by enabling autonomous reply decisions during video playback. The authors introduce MMDuet2, a text-to-text proactive framework trained with supervised fine-tuning and reinforcement learning using a PAUC-based reward to encourage early and accurate responses without precise timestamps, supported by a 52k-video proactive dialogue dataset. They demonstrate state-of-the-art performance on ProactiveVideoQA and competitive results on related benchmarks while preserving offline video understanding, and provide insights into reward design and frame-density effects. The work advances real-time, interactive video understanding and suggests directions for broader data collection, efficiency improvements, and multi-modal extension.

Abstract

Recent advances in video multimodal large language models (Video MLLMs) have significantly enhanced video understanding and multi-modal interaction capabilities. While most existing systems operate in a turn-based manner where the model can only reply after user turns, proactively deciding when to reply during video playback presents a promising yet challenging direction for real-time applications. In this work, we propose a novel text-to-text approach to proactive interaction, where the model autonomously determines whether to respond or remain silent at each turn based on dialogue history and visual context up to current frame of an streaming video. To overcome difficulties in previous methods such as manually tuning response decision thresholds and annotating precise reply times, we introduce a multi-turn RL based training method that encourages timely and accurate responses without requiring precise response time annotations. We train our model MMDuet2 on a dataset of 52k videos with two types of dialogues via SFT and RL. Experimental results demonstrate that MMDuet2 outperforms existing proactive Video MLLM baselines in response timing and quality, achieving state-of-the-art performance on the ProactiveVideoQA benchmark.

MMDuet2: Enhancing Proactive Interaction of Video MLLMs with Multi-Turn Reinforcement Learning

TL;DR

Proactive interaction in video MLLMs addresses the timing gap of turn-based systems by enabling autonomous reply decisions during video playback. The authors introduce MMDuet2, a text-to-text proactive framework trained with supervised fine-tuning and reinforcement learning using a PAUC-based reward to encourage early and accurate responses without precise timestamps, supported by a 52k-video proactive dialogue dataset. They demonstrate state-of-the-art performance on ProactiveVideoQA and competitive results on related benchmarks while preserving offline video understanding, and provide insights into reward design and frame-density effects. The work advances real-time, interactive video understanding and suggests directions for broader data collection, efficiency improvements, and multi-modal extension.

Abstract

Recent advances in video multimodal large language models (Video MLLMs) have significantly enhanced video understanding and multi-modal interaction capabilities. While most existing systems operate in a turn-based manner where the model can only reply after user turns, proactively deciding when to reply during video playback presents a promising yet challenging direction for real-time applications. In this work, we propose a novel text-to-text approach to proactive interaction, where the model autonomously determines whether to respond or remain silent at each turn based on dialogue history and visual context up to current frame of an streaming video. To overcome difficulties in previous methods such as manually tuning response decision thresholds and annotating precise reply times, we introduce a multi-turn RL based training method that encourages timely and accurate responses without requiring precise response time annotations. We train our model MMDuet2 on a dataset of 52k videos with two types of dialogues via SFT and RL. Experimental results demonstrate that MMDuet2 outperforms existing proactive Video MLLM baselines in response timing and quality, achieving state-of-the-art performance on the ProactiveVideoQA benchmark.

Paper Structure

This paper contains 22 sections, 2 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: A conceptual demonstration of the proactive dialogues in the proposed dataset.
  • Figure 2: Chat template of MMDuet2. User turns are marked in orange, assistant turns are marked in blue, and the textual contents of the dialogue between the two roles are underlined for the convenience of reading.
  • Figure 3: An example of a typical video snippet in dataset processing. Video frames circled by the green polygon constitutes a video scene.
  • Figure 4: An illustration of the calculation of the PAUC metric Wang2025ProactiveVideoQAAC.
  • Figure 5: Dynamics of key metrics of model behavior during RL training.