Table of Contents
Fetching ...

Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark

Seng Nam Chen, Hao Chen, Chenglam Ho, Xinyu Mao, Jinping Wang, Yu Zhang, Chao Li

Abstract

Long video understanding (LVU) remains a core challenge in multimodal learning. Although recent vision-language models (VLMs) have made notable progress, existing benchmarks mainly focus on either fine-grained perception or coarse summarization, offering limited insight into temporal understanding over long contexts. In this work, we define a scene as a coherent segment of a video in which both visual and semantic contexts remain consistent, aligning with human perception. This leads us to a key question: can current VLMs reason effectively over long, scene-level contexts? To answer this, we introduce a new benchmark, SceneBench, designed to provide scene-level challenges. Our evaluation reveals a sharp drop in accuracy when VLMs attempt to answer scene-level questions, indicating significant forgetting of long-range context. To further validate these findings, we propose Scene Retrieval-Augmented Generation (Scene-RAG), which constructs a dynamic scene memory by retrieving and integrating relevant context across scenes. This Scene-RAG improves VLM performance by +2.50%, confirming that current models still struggle with long-context retention. We hope SceneBench will encourage future research toward VLMs with more robust, human-like video comprehension.

Seeing the Scene Matters: Revealing Forgetting in Video Understanding Models with a Scene-Aware Long-Video Benchmark

Abstract

Long video understanding (LVU) remains a core challenge in multimodal learning. Although recent vision-language models (VLMs) have made notable progress, existing benchmarks mainly focus on either fine-grained perception or coarse summarization, offering limited insight into temporal understanding over long contexts. In this work, we define a scene as a coherent segment of a video in which both visual and semantic contexts remain consistent, aligning with human perception. This leads us to a key question: can current VLMs reason effectively over long, scene-level contexts? To answer this, we introduce a new benchmark, SceneBench, designed to provide scene-level challenges. Our evaluation reveals a sharp drop in accuracy when VLMs attempt to answer scene-level questions, indicating significant forgetting of long-range context. To further validate these findings, we propose Scene Retrieval-Augmented Generation (Scene-RAG), which constructs a dynamic scene memory by retrieving and integrating relevant context across scenes. This Scene-RAG improves VLM performance by +2.50%, confirming that current models still struggle with long-context retention. We hope SceneBench will encourage future research toward VLMs with more robust, human-like video comprehension.

Paper Structure

This paper contains 23 sections, 1 equation, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: Longer SceneQA distances lead to lower accuracy. Video-RAG luo2024video improves the baseline in short- and long-range settings but struggles in the mid-range, while our Scene-RAG achieves consistent gains, especially for mid- and long-term reasoning. Curves are smoothed using a rolling mean. Results use Qwen2.5-VL qwen2.5-VL. Solid markers denote actual measurements; curves are cubic-spline interpolations for visual clarity.
  • Figure 2: We divide video information into frame, clip, scene, and video levels. For frame-level information, it focuses mainly on describing the details of the subject within the frame. Clip-level information includes temporal information but can only describe the objective behavior of objects. Scene-level information contains a lot of clip-level information and forms a complete scenario event, while video consists of a lot of scene-level content, with related scenes forming a logical storyline based on cause and effect. For convenience of quantification, we use two minutes to distinguish between clip and scene levels.
  • Figure 3: Statistical overview of our SceneBench benchmark. (A) Distribution of scene lengths. (B) Distribution of task counts. (C) Distribution of SceneQA and SceneQA-Audio length duration proportion over the full video.
  • Figure 4: Comparison of Scene-RAG with traditional RAGs working on video understanding tasks. Scene-RAG first aggregates long-range visual scenes via scene tiling, stores aligned visual–audio evidence, and retrieves task-relevant segments conditioned on user queries.
  • Figure 5: QA Accuracy Across Frame Lengths. Performance of SceneQA and SceneQA-Audio across different input frame lengths (16, 32, 64, and 128 frames). While SceneQA shows slight improvement with longer inputs, SceneQA-Audio performance peaks at moderate frame lengths and slightly declines for longer sequences.
  • ...and 3 more figures