Table of Contents
Fetching ...

Cambrian-S: Towards Spatial Supersensing in Video

Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei-Fei, Saining Xie

TL;DR

The paper argues that progress toward true multimodal intelligence requires spatial supersensing—a four-stage hierarchy including semantic perception, streaming event cognition, implicit 3D spatial cognition, and predictive world modeling. It introduces VSI-Super, a two-part long-horizon benchmark (VSR and VSC), and VSI-590K to train Cambrian-S, showing data scale yields gains on existing benchmarks but does not solve continual spatial reasoning. A predictive-sensing paradigm is proposed, demonstrated via a latent-frame-prediction head that uses prediction error to guide memory and event segmentation, outperforming strong baselines on VSI-Super. Overall, the work suggests a shift from scaling alone to developing internal world models that actively predict, select, and organize sensory experience for robust spatial supersensing.

Abstract

We argue that progress in true multimodal intelligence calls for a shift from reactive, task-driven systems and brute-force long context towards a broader paradigm of supersensing. We frame spatial supersensing as four stages beyond linguistic-only understanding: semantic perception (naming what is seen), streaming event cognition (maintaining memory across continuous experiences), implicit 3D spatial cognition (inferring the world behind pixels), and predictive world modeling (creating internal models that filter and organize information). Current benchmarks largely test only the early stages, offering narrow coverage of spatial cognition and rarely challenging models in ways that require true world modeling. To drive progress in spatial supersensing, we present VSI-SUPER, a two-part benchmark: VSR (long-horizon visual spatial recall) and VSC (continual visual spatial counting). These tasks require arbitrarily long video inputs yet are resistant to brute-force context expansion. We then test data scaling limits by curating VSI-590K and training Cambrian-S, achieving +30% absolute improvement on VSI-Bench without sacrificing general capabilities. Yet performance on VSI-SUPER remains limited, indicating that scale alone is insufficient for spatial supersensing. We propose predictive sensing as a path forward, presenting a proof-of-concept in which a self-supervised next-latent-frame predictor leverages surprise (prediction error) to drive memory and event segmentation. On VSI-SUPER, this approach substantially outperforms leading proprietary baselines, showing that spatial supersensing requires models that not only see but also anticipate, select, and organize experience.

Cambrian-S: Towards Spatial Supersensing in Video

TL;DR

The paper argues that progress toward true multimodal intelligence requires spatial supersensing—a four-stage hierarchy including semantic perception, streaming event cognition, implicit 3D spatial cognition, and predictive world modeling. It introduces VSI-Super, a two-part long-horizon benchmark (VSR and VSC), and VSI-590K to train Cambrian-S, showing data scale yields gains on existing benchmarks but does not solve continual spatial reasoning. A predictive-sensing paradigm is proposed, demonstrated via a latent-frame-prediction head that uses prediction error to guide memory and event segmentation, outperforming strong baselines on VSI-Super. Overall, the work suggests a shift from scaling alone to developing internal world models that actively predict, select, and organize sensory experience for robust spatial supersensing.

Abstract

We argue that progress in true multimodal intelligence calls for a shift from reactive, task-driven systems and brute-force long context towards a broader paradigm of supersensing. We frame spatial supersensing as four stages beyond linguistic-only understanding: semantic perception (naming what is seen), streaming event cognition (maintaining memory across continuous experiences), implicit 3D spatial cognition (inferring the world behind pixels), and predictive world modeling (creating internal models that filter and organize information). Current benchmarks largely test only the early stages, offering narrow coverage of spatial cognition and rarely challenging models in ways that require true world modeling. To drive progress in spatial supersensing, we present VSI-SUPER, a two-part benchmark: VSR (long-horizon visual spatial recall) and VSC (continual visual spatial counting). These tasks require arbitrarily long video inputs yet are resistant to brute-force context expansion. We then test data scaling limits by curating VSI-590K and training Cambrian-S, achieving +30% absolute improvement on VSI-Bench without sacrificing general capabilities. Yet performance on VSI-SUPER remains limited, indicating that scale alone is insufficient for spatial supersensing. We propose predictive sensing as a path forward, presenting a proof-of-concept in which a self-supervised next-latent-frame predictor leverages surprise (prediction error) to drive memory and event segmentation. On VSI-SUPER, this approach substantially outperforms leading proprietary baselines, showing that spatial supersensing requires models that not only see but also anticipate, select, and organize experience.

Paper Structure

This paper contains 3 sections, 2 figures.

Figures (2)

  • Figure 1: From pixels to predictive mind. We look beyond linguistic-only understanding to envision multimodal intelligence that sees, remembers, and reasons as part of a continuous, lived world. It begins with semantic perception: naming and describing what is seen. Streaming event cognition goes further, enabling always-on sensing across continuous input streams, integrating memory, and supporting proactive responses. Spatial cognition captures the implicit 3D structure of video, enabling reasoning about objects, configurations, and metrics. Finally, a predictive world model emerges, one that learns passively from experience, updates through prediction and surprise, and retains information for future use. Lower illustration: Video serves as the ideal experimental domain. Models must advance from frame-level Q&A to constructing implicit world models that enable deeper spatial reasoning, scale to unbounded horizons, and achieve supersensing that rivals, and ultimately surpasses, human visual intelligence.
  • Figure 2: Benchmark diagnostic results reveal varying dependence on visual input. We evaluate model under distinct input conditions: (a) multiple (32) uniformly sampled frames, (b) a single (middle) frame, and (c) frame captions, benchmarked against chance-level and blind test results (visual input ignored). Panels (a--c) show absolute accuracies; panels (d--j) show performance differences between conditions. Visual inputs are substantially more critical for VSI-Bench yang2024think, Tomato shangguan2024tomato, and HourVideo chandrasegaran2024hourvideo, while their impact is less pronounced for VideoMME fu2025video, MVBench li2024mvbench, and VideoMMMU hu2025video. VSR and VSC are new supersensing benchmarks introduced in \ref{['sec:benchmark:vsi-super']}.