Table of Contents
Fetching ...

Continuous Perception Matters: Diagnosing Temporal Integration Failures in Multimodal Models

Zeyu Wang, Zhenzhen Weng, Serena Yeung-Levy

TL;DR

Human vision naturally aggregates information over time to form stable scene representations, but current multimodal benchmarks largely evaluate fragmentary, frame-by-frame understanding. CP-Bench provides a deliberately simple, continuous-perception diagnostic by counting identical cubes seen through a moving camera, eliminating texture-based shortcuts. Across open-source and proprietary systems, results show a pervasive failure to accumulate evidence across time, with static controls performing well but temporal integration remaining poor and non-generalizable. The work argues that achieving robust continuous perception will require new architectures and training paradigms that explicitly model persistent spatiotemporal representations.

Abstract

Continuous perception, the ability to integrate visual observations over time in a continuous stream fashion, is essential for robust real-world understanding, yet remains largely untested in current multimodal models. We introduce CP-Bench, a minimal and fully controlled benchmark designed to isolate this capability using an extremely simple task: counting identical cubes in a synthetic scene while the camera moves and only reveals subsets of objects at any moment. Despite the simplicity of the setting, we find that state-of-the-art open-source and commercial models, including Qwen-3-VL, InternVL3, GPT-5, and Gemini-3-Pro, fail dramatically. A static-camera control variant confirms that the failure arises not from object recognition but from an inability to accumulate evidence across time. Further experiments show that neither higher sampling FPS, perception- or spatial-enhanced models, nor finetuning with additional videos leads to meaningful cross-temporal generalization. Our results reveal a fundamental limitation in modern multimodal architectures and training paradigms. CP-Bench provides a simple yet powerful diagnostic tool and establishes a clean testbed for developing models capable of genuine time-consistent visual reasoning.

Continuous Perception Matters: Diagnosing Temporal Integration Failures in Multimodal Models

TL;DR

Human vision naturally aggregates information over time to form stable scene representations, but current multimodal benchmarks largely evaluate fragmentary, frame-by-frame understanding. CP-Bench provides a deliberately simple, continuous-perception diagnostic by counting identical cubes seen through a moving camera, eliminating texture-based shortcuts. Across open-source and proprietary systems, results show a pervasive failure to accumulate evidence across time, with static controls performing well but temporal integration remaining poor and non-generalizable. The work argues that achieving robust continuous perception will require new architectures and training paradigms that explicitly model persistent spatiotemporal representations.

Abstract

Continuous perception, the ability to integrate visual observations over time in a continuous stream fashion, is essential for robust real-world understanding, yet remains largely untested in current multimodal models. We introduce CP-Bench, a minimal and fully controlled benchmark designed to isolate this capability using an extremely simple task: counting identical cubes in a synthetic scene while the camera moves and only reveals subsets of objects at any moment. Despite the simplicity of the setting, we find that state-of-the-art open-source and commercial models, including Qwen-3-VL, InternVL3, GPT-5, and Gemini-3-Pro, fail dramatically. A static-camera control variant confirms that the failure arises not from object recognition but from an inability to accumulate evidence across time. Further experiments show that neither higher sampling FPS, perception- or spatial-enhanced models, nor finetuning with additional videos leads to meaningful cross-temporal generalization. Our results reveal a fundamental limitation in modern multimodal architectures and training paradigms. CP-Bench provides a simple yet powerful diagnostic tool and establishes a clean testbed for developing models capable of genuine time-consistent visual reasoning.
Paper Structure (21 sections, 5 figures, 4 tables)

This paper contains 21 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Human perception (top) operates on a continuous visual stream, which enables a holistic and deep understanding of spatio-temporal events. In contrast, dominant AI models (bottom) employ a separate frames paradigm, processing sparse, discrete snapshots. We argue this leads to a fragmented and superficial understanding that fundamentally fails to capture true spatio-temporal continuity.
  • Figure 2: Illustration of the proposed Continuous Perception Benchmark (CP-Bench). In the Main Setting (top-left), a camera performs a continuous horizontal pan, with only a subset of visually identical cubes visible at any given moment. This necessitates continuous spatio-temporal correspondence for accurate counting. In the Control Setting (top-right), the camera is static, and all cubes are visible simultaneously, allowing for static-frame counting. The bottom panel provides visual examples of sampled frames from typical videos.
  • Figure 3: Prediction distribution for Gemini-3-Pro (left) and Qwen2.5VL-7B (right) on the Continuous Perception Benchmark (CP-Bench). Green bars indicate correct predictions, while red bars represent incorrect predictions. The X-axis shows the model's predicted count, grouped by the ground truth (GT) count for each set of instances.
  • Figure 4: Fine-tuning generalization experiments. Models trained on 5s videos achieve perfect accuracy on the 5s test set but fail to generalize to 10s videos (left, 80-point drop). Conversely, models trained on 10s videos perform well on the 10s test set but fail to generalize to 5s videos (right, 69-point drop). This demonstrates that fine-tuning leads to overfitting to specific video dynamics (e.g., duration, camera speed) rather than learning the generalizable skill of continuous perception.
  • Figure 5: Two qualitative examples illustrating cases in which all evaluated models produce incorrect answers.