Table of Contents
Fetching ...

VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format

Yueqian Wang, Xiaojun Meng, Yuxuan Wang, Jianxin Liang, Jiansheng Wei, Huishuai Zhang, Dongyan Zhao

TL;DR

This work introduces the video-text duet interaction format to overcome the timeliness and grounding limitations of traditional whole-video VideoLLMs. It presents MMDuet, a three-component model augmented with informative and relevance heads, trained on the MMDuetIT dataset to learn when and where to respond during streaming video. The authors also propose MAGQA and construct a diverse training suite that enables dense captioning, temporal grounding, and multi-answer grounded QA in real time. Empirically, MMDuet improves performance on time-sensitive tasks such as dense captioning, highlight detection, temporal grounding, and MAGQA, while enabling real-time replies during video playback, demonstrating practical impact for live-streaming and surveillance applications.

Abstract

Recent researches on video large language models (VideoLLM) predominantly focus on model architectures and training datasets, leaving the interaction format between the user and the model under-explored. In existing works, users often interact with VideoLLMs by using the entire video and a query as input, after which the model generates a response. This interaction format constrains the application of VideoLLMs in scenarios such as live-streaming comprehension where videos do not end and responses are required in a real-time manner, and also results in unsatisfactory performance on time-sensitive tasks that requires localizing video segments. In this paper, we focus on a video-text duet interaction format. This interaction format is characterized by the continuous playback of the video, and both the user and the model can insert their text messages at any position during the video playback. When a text message ends, the video continues to play, akin to the alternative of two performers in a duet. We construct MMDuetIT, a video-text training dataset designed to adapt VideoLLMs to video-text duet interaction format. We also introduce the Multi-Answer Grounded Video Question Answering (MAGQA) task to benchmark the real-time response ability of VideoLLMs. Trained on MMDuetIT, MMDuet demonstrates that adopting the video-text duet interaction format enables the model to achieve significant improvements in various time-sensitive tasks (76% CIDEr on YouCook2 dense video captioning, 90\% mAP on QVHighlights highlight detection and 25% R@0.5 on Charades-STA temporal video grounding) with minimal training efforts, and also enable VideoLLMs to reply in a real-time manner as the video plays.

VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format

TL;DR

This work introduces the video-text duet interaction format to overcome the timeliness and grounding limitations of traditional whole-video VideoLLMs. It presents MMDuet, a three-component model augmented with informative and relevance heads, trained on the MMDuetIT dataset to learn when and where to respond during streaming video. The authors also propose MAGQA and construct a diverse training suite that enables dense captioning, temporal grounding, and multi-answer grounded QA in real time. Empirically, MMDuet improves performance on time-sensitive tasks such as dense captioning, highlight detection, temporal grounding, and MAGQA, while enabling real-time replies during video playback, demonstrating practical impact for live-streaming and surveillance applications.

Abstract

Recent researches on video large language models (VideoLLM) predominantly focus on model architectures and training datasets, leaving the interaction format between the user and the model under-explored. In existing works, users often interact with VideoLLMs by using the entire video and a query as input, after which the model generates a response. This interaction format constrains the application of VideoLLMs in scenarios such as live-streaming comprehension where videos do not end and responses are required in a real-time manner, and also results in unsatisfactory performance on time-sensitive tasks that requires localizing video segments. In this paper, we focus on a video-text duet interaction format. This interaction format is characterized by the continuous playback of the video, and both the user and the model can insert their text messages at any position during the video playback. When a text message ends, the video continues to play, akin to the alternative of two performers in a duet. We construct MMDuetIT, a video-text training dataset designed to adapt VideoLLMs to video-text duet interaction format. We also introduce the Multi-Answer Grounded Video Question Answering (MAGQA) task to benchmark the real-time response ability of VideoLLMs. Trained on MMDuetIT, MMDuet demonstrates that adopting the video-text duet interaction format enables the model to achieve significant improvements in various time-sensitive tasks (76% CIDEr on YouCook2 dense video captioning, 90\% mAP on QVHighlights highlight detection and 25% R@0.5 on Charades-STA temporal video grounding) with minimal training efforts, and also enable VideoLLMs to reply in a real-time manner as the video plays.

Paper Structure

This paper contains 38 sections, 11 figures, 7 tables.

Figures (11)

  • Figure 1: An example of the common Whole Video Interaction Format and our Video-Text Duet Interaction Format.
  • Figure 2: Example of reformatting the annotation of a video segment to video-text duet interaction format in MMDuetIT. Information from the original annotation is emphasized with underlines.
  • Figure 3: Data Distribution of MMDuetIT.
  • Figure 4: Performance on temporal video grounding and highlight detection with different $w$.
  • Figure 5: Performance on dense video captioning with different $s$.
  • ...and 6 more figures