OmniMMI: A Comprehensive Multi-modal Interaction Benchmark in Streaming Video Contexts
Yuxuan Wang, Yueqian Wang, Bo Chen, Tong Wu, Dongyan Zhao, Zilong Zheng
TL;DR
OmniMMI introduces the first comprehensive benchmark for evaluating OmniLLMs in streaming video contexts, focusing on streaming understanding and proactive reasoning across six subtasks. It accompanies a novel inference framework, M4, which enables see/listen while generating via multiplexed inputs, highlight-based KV caching, and parallel decoding, plus a video-free synthetic tuning set M4-IT to facilitate proactive capabilities without extra video training. The dataset comprises 1,121 videos and 2,290 questions, with multi-turn prompts that emulate real-time interactive scenarios, revealing limitations of current models in multi-turn streaming tasks and audio-visual alignment. The work demonstrates that long-context models and carefully designed interleaved instruction data can improve certain interactive tasks, underscoring the need for efficient streaming designs and better modality integration for open-world multimodal agents.
Abstract
The rapid advancement of multi-modal language models (MLLMs) like GPT-4o has propelled the development of Omni language models, designed to process and proactively respond to continuous streams of multi-modal data. Despite their potential, evaluating their real-world interactive capabilities in streaming video contexts remains a formidable challenge. In this work, we introduce OmniMMI, a comprehensive multi-modal interaction benchmark tailored for OmniLLMs in streaming video contexts. OmniMMI encompasses over 1,121 videos and 2,290 questions, addressing two critical yet underexplored challenges in existing video benchmarks: streaming video understanding and proactive reasoning, across six distinct subtasks. Moreover, we propose a novel framework, Multi-modal Multiplexing Modeling (M4), designed to enable an inference-efficient streaming model that can see, listen while generating.
