Continuous Perception Matters: Diagnosing Temporal Integration Failures in Multimodal Models
Zeyu Wang, Zhenzhen Weng, Serena Yeung-Levy
TL;DR
Human vision naturally aggregates information over time to form stable scene representations, but current multimodal benchmarks largely evaluate fragmentary, frame-by-frame understanding. CP-Bench provides a deliberately simple, continuous-perception diagnostic by counting identical cubes seen through a moving camera, eliminating texture-based shortcuts. Across open-source and proprietary systems, results show a pervasive failure to accumulate evidence across time, with static controls performing well but temporal integration remaining poor and non-generalizable. The work argues that achieving robust continuous perception will require new architectures and training paradigms that explicitly model persistent spatiotemporal representations.
Abstract
Continuous perception, the ability to integrate visual observations over time in a continuous stream fashion, is essential for robust real-world understanding, yet remains largely untested in current multimodal models. We introduce CP-Bench, a minimal and fully controlled benchmark designed to isolate this capability using an extremely simple task: counting identical cubes in a synthetic scene while the camera moves and only reveals subsets of objects at any moment. Despite the simplicity of the setting, we find that state-of-the-art open-source and commercial models, including Qwen-3-VL, InternVL3, GPT-5, and Gemini-3-Pro, fail dramatically. A static-camera control variant confirms that the failure arises not from object recognition but from an inability to accumulate evidence across time. Further experiments show that neither higher sampling FPS, perception- or spatial-enhanced models, nor finetuning with additional videos leads to meaningful cross-temporal generalization. Our results reveal a fundamental limitation in modern multimodal architectures and training paradigms. CP-Bench provides a simple yet powerful diagnostic tool and establishes a clean testbed for developing models capable of genuine time-consistent visual reasoning.
