OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

Yifei Li; Junbo Niu; Ziyang Miao; Chunjiang Ge; Yuanhang Zhou; Qihao He; Xiaoyi Dong; Haodong Duan; Shuangrui Ding; Rui Qian; Pan Zhang; Yuhang Zang; Yuhang Cao; Conghui He; Jiaqi Wang

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, Shuangrui Ding, Rui Qian, Pan Zhang, Yuhang Zang, Yuhang Cao, Conghui He, Jiaqi Wang

TL;DR

OVO-Bench introduces a formal benchmark for online video understanding, emphasizing temporal awareness across three modes—Backward Tracing, Real-Time Visual Perception, and Forward Active Responding—and evaluating 12 tasks with 644 videos and ~2.8k meta-annotations. It combines automated QA generation with human curation to create precise, timestamped prompts, enabling systematic querying of Video-LLMs along the video timeline. Experimental results show a gap between current offline and online Video-LLMs, with offline models sometimes transferring better to online-like tasks but online models suffering from latency and hallucinations, especially in forward-responding scenarios. By providing a dedicated evaluation framework, dataset, and baseline analyses, OVO-Bench aims to drive progress toward practical, real-world online video understanding and reasoning in AI systems.

Abstract

Temporal Awareness, the ability to reason dynamically based on the timestamp when a question is raised, is the key distinction between offline and online video LLMs. Unlike offline models, which rely on complete videos for static, post hoc analysis, online models process video streams incrementally and dynamically adapt their responses based on the timestamp at which the question is posed. Despite its significance, temporal awareness has not been adequately evaluated in existing benchmarks. To fill this gap, we present OVO-Bench (Online-VideO-Benchmark), a novel video benchmark that emphasizes the importance of timestamps for advanced online video understanding capability benchmarking. OVO-Bench evaluates the ability of video LLMs to reason and respond to events occurring at specific timestamps under three distinct scenarios: (1) Backward tracing: trace back to past events to answer the question. (2) Real-time understanding: understand and respond to events as they unfold at the current timestamp. (3) Forward active responding: delay the response until sufficient future information becomes available to answer the question accurately. OVO-Bench comprises 12 tasks, featuring 644 unique videos and approximately human-curated 2,800 fine-grained meta-annotations with precise timestamps. We combine automated generation pipelines with human curation. With these high-quality samples, we further developed an evaluation pipeline to systematically query video LLMs along the video timeline. Evaluations of nine Video-LLMs reveal that, despite advancements on traditional benchmarks, current models struggle with online video understanding, showing a significant gap compared to human agents. We hope OVO-Bench will drive progress in video LLMs and inspire future research in online video reasoning. Our benchmark and code can be accessed at https://github.com/JoeLeelyf/OVO-Bench.

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

TL;DR

Abstract

Paper Structure (29 sections, 11 figures, 2 tables)

This paper contains 29 sections, 11 figures, 2 tables.

Introduction
Related Works
OVO-Bench
Online Video Understanding Mode Taxonomy
Backward Tracing
Real-Time Visual Perception
Forward Active Responding
Benchmark Construction
Video and Annotation Collection
Prompt Generation
Datasets Statistics
Experiments
Models and Evaluation Strategies
Main Results
Comparison between online Video-LLMs and offline Video-LLMs
...and 14 more sections

Figures (11)

Figure 1: A demonstrative comparison between offline and online video understanding videollm-online. Offline video understanding focuses on answering questions based on the entirety of a video. In contrast, online video understanding involves posing queries about the context of a video at intermediate points, demanding the ability to trace back past information, perceive ongoing events, and adapt to continuous input.
Figure 2: Examples of each task in OVO-Bench. The 14 tasks are categorized into three different kinds of perceiving modes in online video understanding: Backward Tracing, Real-Time Visual Perception, and Forward Active Responding.
Figure 3: Generation pipeline of OVO-Bench. Within public annotations, data is carefully filtered and relevant multiple-choice QAs are auto-generated. The effective system prompt and efficient answer prompt are employed to guide MLLMs toward precise outputs. The Video-LLMs we use to annotate videos are GPT-4o and Gemini-1.5 Pro.
Figure 4: Left: Queries Temporal Distribution in OVO-Bench. Center: Linguistic Characteristics of Text Queries. Right: Video category distribution of OVO-Bench.
Figure 5: Performance comparison between online Video-LLMs and offline Video-LLMs. The figure illustrates the average scores of different models on the OVO-Bench in real-time visual perception tasks.
...and 6 more figures

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

TL;DR

Abstract

OVO-Bench: How Far is Your Video-LLMs from Real-World Online Video Understanding?

Authors

TL;DR

Abstract

Table of Contents

Figures (11)