Table of Contents
Fetching ...

OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, Zilong Zheng

TL;DR

OmniMMI introduces the first comprehensive benchmark for evaluating OmniLLMs in streaming video contexts, focusing on streaming understanding and proactive reasoning across six subtasks. It accompanies a novel inference framework, M4, which enables see/listen while generating via multiplexed inputs, highlight-based KV caching, and parallel decoding, plus a video-free synthetic tuning set M4-IT to facilitate proactive capabilities without extra video training. The dataset comprises 1,121 videos and 2,290 questions, with multi-turn prompts that emulate real-time interactive scenarios, revealing limitations of current models in multi-turn streaming tasks and audio-visual alignment. The work demonstrates that long-context models and carefully designed interleaved instruction data can improve certain interactive tasks, underscoring the need for efficient streaming designs and better modality integration for open-world multimodal agents.

Abstract

The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data. Despite their potential, evaluating their real-world interactive capabilities in streaming video contexts remains a formidable challenge. In this work, we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks. Moreover, we propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enable an inference-efficient streaming model that can see, listen while generating.

OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts

TL;DR

OmniMMI introduces the first comprehensive benchmark for evaluating OmniLLMs in streaming video contexts, focusing on streaming understanding and proactive reasoning across six subtasks. It accompanies a novel inference framework, M4, which enables see/listen while generating via multiplexed inputs, highlight-based KV caching, and parallel decoding, plus a video-free synthetic tuning set M4-IT to facilitate proactive capabilities without extra video training. The dataset comprises 1,121 videos and 2,290 questions, with multi-turn prompts that emulate real-time interactive scenarios, revealing limitations of current models in multi-turn streaming tasks and audio-visual alignment. The work demonstrates that long-context models and carefully designed interleaved instruction data can improve certain interactive tasks, underscoring the need for efficient streaming designs and better modality integration for open-world multimodal agents.

Abstract

The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data. Despite their potential, evaluating their real-world interactive capabilities in streaming video contexts remains a formidable challenge. In this work, we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks. Moreover, we propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enable an inference-efficient streaming model that can see, listen while generating.

Paper Structure

This paper contains 46 sections, 1 equation, 7 figures, 10 tables, 1 algorithm.

Figures (7)

  • Figure 1: OmniMMI consists of two categories of multi-modal interactive challenges: streaming video understanding (top) and proactive reasoning (bottom). Each query is processed into natural language text and synthetic audio as input.
  • Figure 2: Distribution and examples of different types of query prompts.
  • Figure 3: Distribution of video duration length.
  • Figure 4: Multiplexing Modeling of M4. $v$ is the streaming video, $q_i$ denotes the input query, $t_i$ indicates the generated token, $n_i$ denotes noise token which will be discarded from the KVCache. The streaming video KVCache is computed to trigger a highlight spot index for the next response generation. Proactive interruption is facilitated through the computation of specific tokens designed for noise and stop signals. The parallel decoding takes mask strategy with dynamic KVCache to process multiple queries in one forward step.
  • Figure 5: Attention feature map utilizes query as Q frames as K. The query consists of the last three tokens of the text query, while the key is represented by the mean-pooled frame.
  • ...and 2 more figures