Table of Contents
Fetching ...

Online Video Understanding: OVBench and VideoChat-Online

Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xiangyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, Limin Wang

TL;DR

The paper tackles the challenge of real-time online video understanding by introducing OVBench, a benchmark tailored for streaming spatiotemporal reasoning, along with a Pyramid Memory Bank that balances spatial detail and temporal continuity. It couples this with an offline-to-online training paradigm to fuse offline video data with live streaming data, yielding VideoChat-Online, a 4B-parameter model that achieves state-of-the-art results on OVBench and strong performance on existing offline benchmarks. Key contributions include the PMB memory architecture, an interleaved dialogue-style data format for online training, and comprehensive ablations demonstrating the benefits of memory design, updating strategies, and progressive training. Together, these advances enable efficient, real-time, multimodal video understanding suitable for real-world applications like autonomous driving and human-computer interaction, while preserving strong generalization to offline tasks.

Abstract

Multimodal Large Language Models (MLLMs) have significantly progressed in offline video understanding. However, applying these models to real-world scenarios, such as autonomous driving and human-computer interaction, presents unique challenges due to the need for real-time processing of continuous online video streams. To this end, this paper presents systematic efforts from three perspectives: evaluation benchmark, model architecture, and training strategy. First, we introduce OVBench, a comprehensive question-answering benchmark designed to evaluate models' ability to perceive, memorize, and reason within online video contexts. It features 6 core task types across three temporal contexts-past, current, and future-forming 16 subtasks from diverse datasets. Second, we propose a new Pyramid Memory Bank (PMB) that effectively retains key spatiotemporal information in video streams. Third, we proposed an offline-to-online learning paradigm, designing an interleaved dialogue format for online video data and constructing an instruction-tuning dataset tailored for online video training. This framework led to the development of VideoChat-Online, a robust and efficient model for online video understanding. Despite the lower computational cost and higher efficiency, VideoChat-Online outperforms existing state-of-the-art offline and online models across popular offline video benchmarks and OVBench, demonstrating the effectiveness of our model architecture and training strategy. % Our approach surpasses existing state-of-the-art offline models Qwen2-VL 7B and online models Flash-VStream, by 4.19% and 23.7% on OVBench, respectively.

Online Video Understanding: OVBench and VideoChat-Online

TL;DR

The paper tackles the challenge of real-time online video understanding by introducing OVBench, a benchmark tailored for streaming spatiotemporal reasoning, along with a Pyramid Memory Bank that balances spatial detail and temporal continuity. It couples this with an offline-to-online training paradigm to fuse offline video data with live streaming data, yielding VideoChat-Online, a 4B-parameter model that achieves state-of-the-art results on OVBench and strong performance on existing offline benchmarks. Key contributions include the PMB memory architecture, an interleaved dialogue-style data format for online training, and comprehensive ablations demonstrating the benefits of memory design, updating strategies, and progressive training. Together, these advances enable efficient, real-time, multimodal video understanding suitable for real-world applications like autonomous driving and human-computer interaction, while preserving strong generalization to offline tasks.

Abstract

Multimodal Large Language Models (MLLMs) have significantly progressed in offline video understanding. However, applying these models to real-world scenarios, such as autonomous driving and human-computer interaction, presents unique challenges due to the need for real-time processing of continuous online video streams. To this end, this paper presents systematic efforts from three perspectives: evaluation benchmark, model architecture, and training strategy. First, we introduce OVBench, a comprehensive question-answering benchmark designed to evaluate models' ability to perceive, memorize, and reason within online video contexts. It features 6 core task types across three temporal contexts-past, current, and future-forming 16 subtasks from diverse datasets. Second, we propose a new Pyramid Memory Bank (PMB) that effectively retains key spatiotemporal information in video streams. Third, we proposed an offline-to-online learning paradigm, designing an interleaved dialogue format for online video data and constructing an instruction-tuning dataset tailored for online video training. This framework led to the development of VideoChat-Online, a robust and efficient model for online video understanding. Despite the lower computational cost and higher efficiency, VideoChat-Online outperforms existing state-of-the-art offline and online models across popular offline video benchmarks and OVBench, demonstrating the effectiveness of our model architecture and training strategy. % Our approach surpasses existing state-of-the-art offline models Qwen2-VL 7B and online models Flash-VStream, by 4.19% and 23.7% on OVBench, respectively.
Paper Structure (28 sections, 3 equations, 14 figures, 16 tables)

This paper contains 28 sections, 3 equations, 14 figures, 16 tables.

Figures (14)

  • Figure 1: OVBench contains 6 core spatiotemporal understanding tasks in online scenarios, incorporating three primary temporal contexts—past, current, and future. Based on various interaction types, it is expanded into 16 subtasks in total.
  • Figure 2: Generation pipeline of OVBench. We developed a method to ensure the quality of annotation based on the existing high-quality spatiotemporal data, including task definition, data collection, QA construction, and multiple-choice question generation suitable for streaming video scenarios. The details will be discussed in Section \ref{['sec:ovbench']}.
  • Figure 3: Pyramid Memory Bank Architecture: Illustrating the model's inference process with the pyramid memory bank structure. $m_{main}$ queues maintain balanced spatiotemporal information at different hierarchical levels, $m_t$ is a high-frequency sampling queue for enhanced temporal detail preservation, and $m_s$ queue is for spatial detail retention. The system supports simultaneous frame input to both the memory bank and KVCache, with synchronization mechanisms for maintaining consistency during memory modifications.
  • Figure 4: Data Format Conversion Process for Online Spatiotemporal Instruction-Finetuning. Our pipeline begins with 96K high-quality samples curated from 5 tasks across 12 datasets. The conversion process enhances online spatiotemporal understanding through template transformation. For each video sample, we strategically insert queries along the timeline in an organized interleaved format to facilitate temporal context differentiation.
  • Figure 5: Qualitative comparison on online data training
  • ...and 9 more figures