Table of Contents
Fetching ...

Can Vision-Language Models Solve the Shell Game?

Tiedong Liu, Wee Sun Lee

TL;DR

A theoretical analysis drawing connections to the state-tracking problem proves that fixed-depth transformer-based VLMs are fundamentally limited in tracking indistinguishable objects without intermediate supervision due to expressivity constraints, and proposes Spatiotemporal Grounded Chain-of-Thought (SGCoT): generating object trajectories as explicit intermediate states.

Abstract

Visual entity tracking is an innate cognitive ability in humans, yet it remains a critical bottleneck for Vision-Language Models (VLMs). This deficit is often obscured in existing video benchmarks by visual shortcuts. We introduce VET-Bench, a synthetic diagnostic testbed featuring visually identical objects that necessitate tracking exclusively through spatiotemporal continuity. Our experiments reveal that current state-of-the-art VLMs perform at or near chance level on VET-Bench, exposing a fundamental limitation: an over-reliance on static frame-level features and a failure to maintain entity representations over time. We provide a theoretical analysis drawing connections to the state-tracking problem, proving that fixed-depth transformer-based VLMs are fundamentally limited in tracking indistinguishable objects without intermediate supervision due to expressivity constraints. To address this, we propose Spatiotemporal Grounded Chain-of-Thought (SGCoT): generating object trajectories as explicit intermediate states. Leveraging Molmo2's object tracking ability, we elicit SGCoT reasoning by fine-tuning on synthesized text-only data for alignment. Our method achieves state-of-the-art accuracy exceeding 90% on VET-Bench, demonstrating that VLMs can reliably solve the video shell-game task end-to-end without external tools. Our code and data are available at https://vetbench.github.io .

Can Vision-Language Models Solve the Shell Game?

TL;DR

A theoretical analysis drawing connections to the state-tracking problem proves that fixed-depth transformer-based VLMs are fundamentally limited in tracking indistinguishable objects without intermediate supervision due to expressivity constraints, and proposes Spatiotemporal Grounded Chain-of-Thought (SGCoT): generating object trajectories as explicit intermediate states.

Abstract

Visual entity tracking is an innate cognitive ability in humans, yet it remains a critical bottleneck for Vision-Language Models (VLMs). This deficit is often obscured in existing video benchmarks by visual shortcuts. We introduce VET-Bench, a synthetic diagnostic testbed featuring visually identical objects that necessitate tracking exclusively through spatiotemporal continuity. Our experiments reveal that current state-of-the-art VLMs perform at or near chance level on VET-Bench, exposing a fundamental limitation: an over-reliance on static frame-level features and a failure to maintain entity representations over time. We provide a theoretical analysis drawing connections to the state-tracking problem, proving that fixed-depth transformer-based VLMs are fundamentally limited in tracking indistinguishable objects without intermediate supervision due to expressivity constraints. To address this, we propose Spatiotemporal Grounded Chain-of-Thought (SGCoT): generating object trajectories as explicit intermediate states. Leveraging Molmo2's object tracking ability, we elicit SGCoT reasoning by fine-tuning on synthesized text-only data for alignment. Our method achieves state-of-the-art accuracy exceeding 90% on VET-Bench, demonstrating that VLMs can reliably solve the video shell-game task end-to-end without external tools. Our code and data are available at https://vetbench.github.io .
Paper Structure (37 sections, 4 theorems, 15 equations, 17 figures, 2 tables)

This paper contains 37 sections, 4 theorems, 15 equations, 17 figures, 2 tables.

Key Result

Theorem 1

For any fixed $k \ge 5$, $\mathrm{TRACK}_k$ is $\mathbf{NC}^1$-complete.

Figures (17)

  • Figure 1: Overview of VET-Bench.
  • Figure 2: Performance on VET-Bench, consisting of 50 cups-game and 50 cards-game videos featuring 3 objects and 5 swaps ($\sim$12 seconds). Existing VLMs all perform near random chance. Molmo2-SGCoT is a fine-tuned model based on Molmo2 that leverages Spatiotemporal Grounded Chain-of-Thought (SGCoT) to solve the shell game (Section \ref{['sec:sgcot']}).
  • Figure 3: Performance of VLMs under different swap and object counts.
  • Figure 4: Training and validation loss for direct-answer training on 500 synthetic VET-Bench cups-game videos.
  • Figure 5: Example frames from videos involving distinct cups in the Perception Test.
  • ...and 12 more figures

Theorems & Definitions (10)

  • Definition 1: Visual Entity Tracking, $\mathrm{TRACK}_k$
  • Definition 2: Word Problem for $S_5$, $\mathrm{WORD}_{S_5}$
  • Theorem 1
  • proof : Proof Sketch
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • Theorem 1
  • proof