Table of Contents
Fetching ...

Solving Spatial Supersensing Without Spatial Supersensing

Vishaal Udandarao, Shyamgopal Karthik, Surabhi S. Nath, Andreas Hochlehnert, Matthias Bethge, Ameya Prabhu

TL;DR

The paper challenges the claim that VSI-Super benchmarks require spatial supersensing by showing that simple retrieval-based baselines can solve VSR and that VSC counting relies on benchmark quirks. It introduces NoSense, a streaming, frame-level SigLIP/CLIP-based baseline for VSR, and a VSC-Repeat sanity test that disrupts the counting strategy, revealing reliance on shortcuts. The authors argue that current benchmarks co-adapt with inference pipelines and propose design principles such as invariance checks, longer continuous videos, and meta-evaluation to better assess genuine spatial supersensing. The work underscores the need for robust world-model evaluation in video understanding and points toward future benchmark designs that enforce revisits and long-horizon integration.

Abstract

Cambrian-S aims to take the first steps towards improving video world models with spatial supersensing by introducing (i) two benchmarks, VSI-Super-Recall (VSR) and VSI-Super-Counting (VSC), and (ii) bespoke predictive sensing inference strategies tailored to each benchmark. In this work, we conduct a critical analysis of Cambrian-S across both these fronts. First, we introduce a simple baseline, NoSense, which discards almost all temporal structure and uses only a bag-of-words SigLIP model, yet near-perfectly solves VSR, achieving 95% accuracy even on 4-hour videos. This shows benchmarks like VSR can be nearly solved without spatial cognition, world modeling or spatial supersensing. Second, we hypothesize that the tailored inference methods proposed by Cambrian-S likely exploit shortcut heuristics in the benchmark. We illustrate this with a simple sanity check on the VSC benchmark, called VSC-Repeat: We concatenate each video with itself 1-5 times, which does not change the number of unique objects. However, this simple perturbation entirely collapses the mean relative accuracy of Cambrian-S from 42% to 0%. A system that performs spatial supersensing and integrates information across experiences should recognize views of the same scene and keep object-count predictions unchanged; instead, Cambrian-S inference algorithm relies largely on a shortcut in the VSC benchmark that rooms are never revisited. Taken together, our findings suggest that (i) current VSI-Super benchmarks do not yet reliably measure spatial supersensing, and (ii) predictive-sensing inference recipes used by Cambrian-S improve performance by inadvertently exploiting shortcuts rather than from robust spatial supersensing. We include the response from the Cambrian-S authors (in Appendix A) to provide a balanced perspective alongside our claims. We release our code at: https://github.com/bethgelab/supersanity

Solving Spatial Supersensing Without Spatial Supersensing

TL;DR

The paper challenges the claim that VSI-Super benchmarks require spatial supersensing by showing that simple retrieval-based baselines can solve VSR and that VSC counting relies on benchmark quirks. It introduces NoSense, a streaming, frame-level SigLIP/CLIP-based baseline for VSR, and a VSC-Repeat sanity test that disrupts the counting strategy, revealing reliance on shortcuts. The authors argue that current benchmarks co-adapt with inference pipelines and propose design principles such as invariance checks, longer continuous videos, and meta-evaluation to better assess genuine spatial supersensing. The work underscores the need for robust world-model evaluation in video understanding and points toward future benchmark designs that enforce revisits and long-horizon integration.

Abstract

Cambrian-S aims to take the first steps towards improving video world models with spatial supersensing by introducing (i) two benchmarks, VSI-Super-Recall (VSR) and VSI-Super-Counting (VSC), and (ii) bespoke predictive sensing inference strategies tailored to each benchmark. In this work, we conduct a critical analysis of Cambrian-S across both these fronts. First, we introduce a simple baseline, NoSense, which discards almost all temporal structure and uses only a bag-of-words SigLIP model, yet near-perfectly solves VSR, achieving 95% accuracy even on 4-hour videos. This shows benchmarks like VSR can be nearly solved without spatial cognition, world modeling or spatial supersensing. Second, we hypothesize that the tailored inference methods proposed by Cambrian-S likely exploit shortcut heuristics in the benchmark. We illustrate this with a simple sanity check on the VSC benchmark, called VSC-Repeat: We concatenate each video with itself 1-5 times, which does not change the number of unique objects. However, this simple perturbation entirely collapses the mean relative accuracy of Cambrian-S from 42% to 0%. A system that performs spatial supersensing and integrates information across experiences should recognize views of the same scene and keep object-count predictions unchanged; instead, Cambrian-S inference algorithm relies largely on a shortcut in the VSC benchmark that rooms are never revisited. Taken together, our findings suggest that (i) current VSI-Super benchmarks do not yet reliably measure spatial supersensing, and (ii) predictive-sensing inference recipes used by Cambrian-S improve performance by inadvertently exploiting shortcuts rather than from robust spatial supersensing. We include the response from the Cambrian-S authors (in Appendix A) to provide a balanced perspective alongside our claims. We release our code at: https://github.com/bethgelab/supersanity

Paper Structure

This paper contains 7 sections, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: (Left) NoSense solves VSR without supersensing. Our NoSense baseline uses only a SigLIP model with independent frame-level processing -- no video model, LLM, long-term memory, or temporal reasoning -- yet almost perfectly solves VSR, showing that the VSR benchmark can be solved without spatial supersensing. (Right) Cambrian-S exploits VSC-specific shortcuts. For the VSC benchmark, we repeat each 10-min video 1–5 times; a supersensing model should output the same object counts, since the unique object count stays the same. Instead, Cambrian-S' mean relative accuracy collapses from 42% to 0% after 5 repeats, indicating that its surprise-based segmentation inference method relies on VSC-specific shortcuts rather than genuine spatial cognition.
  • Figure 2: NoSense does no spatial supersensing. Frames are encoded independently with SigLIP. We keep top-4 frames by cosine similarity to the object query and aggregate similarities to auxiliary objects from the MCQ options to select the answer. The pipeline is streaming, memory-efficient and uses only the relative order of the four most object‑relevant frames; it never reasons about motion, continuity, or long-range temporal patterns.
  • Figure 3: NoSense solves VSR with no supersensing.NoSense uses only a SigLIP image encoder with independent frame-level processing—no video model, LLM, memory, or temporal reasoning. Yet, NoSense nearly perfectly solves VSR (left), while using a fraction of the GPU memory of previous methods (right). This clearly shows that VSR can be solved without explicit 3D state, object tracking, or long‑horizon temporal reasoning.
  • Figure 4: Repeating VSC videos exposes a counting shortcut. We propose a simple sanity check for VSC, called VSC-Repeat. We concatenate each VSC video (from the 10-min split) with itself $1-5$ times. Since no new scene is introduced, the ground-truth number of unique objects remains unchanged. This sanity check can help test if models indeed hold long-term global state or rather exploit simple segmentation-based shortcuts.
  • Figure 5: (Left) Cambrian collapses on VSC-Repeat with mean relative accuracy going from 42% to 0% after 5 repeats, indicating that its inference method relies on VSC-specific shortcuts rather than genuine spatial cognition. (Right) Cambrian-S near-perfectly overcounts the objects proportional to repeats indicating the predicted number of objects is strongly correlated to number of repeats, i.e. simply counting objects across new rooms as unique rather than maintaining a persistent set of unique objects across rooms. A supersensing model should output the same object counts across VSC-Repeat, since the unique object count stays the same.