Can Vision-Language Models Solve the Shell Game?

Tiedong Liu; Wee Sun Lee

Can Vision-Language Models Solve the Shell Game?

Tiedong Liu, Wee Sun Lee

TL;DR

A theoretical analysis drawing connections to the state-tracking problem proves that fixed-depth transformer-based VLMs are fundamentally limited in tracking indistinguishable objects without intermediate supervision due to expressivity constraints, and proposes Spatiotemporal Grounded Chain-of-Thought (SGCoT): generating object trajectories as explicit intermediate states.

Abstract

Visual entity tracking is an innate cognitive ability in humans, yet it remains a critical bottleneck for Vision-Language Models (VLMs). This deficit is often obscured in existing video benchmarks by visual shortcuts. We introduce VET-Bench, a synthetic diagnostic testbed featuring visually identical objects that necessitate tracking exclusively through spatiotemporal continuity. Our experiments reveal that current state-of-the-art VLMs perform at or near chance level on VET-Bench, exposing a fundamental limitation: an over-reliance on static frame-level features and a failure to maintain entity representations over time. We provide a theoretical analysis drawing connections to the state-tracking problem, proving that fixed-depth transformer-based VLMs are fundamentally limited in tracking indistinguishable objects without intermediate supervision due to expressivity constraints. To address this, we propose Spatiotemporal Grounded Chain-of-Thought (SGCoT): generating object trajectories as explicit intermediate states. Leveraging Molmo2's object tracking ability, we elicit SGCoT reasoning by fine-tuning on synthesized text-only data for alignment. Our method achieves state-of-the-art accuracy exceeding 90% on VET-Bench, demonstrating that VLMs can reliably solve the video shell-game task end-to-end without external tools. Our code and data are available at https://vetbench.github.io .

Can Vision-Language Models Solve the Shell Game?

TL;DR

Abstract

Paper Structure (37 sections, 4 theorems, 15 equations, 17 figures, 2 tables)

This paper contains 37 sections, 4 theorems, 15 equations, 17 figures, 2 tables.

Introduction
Contributions.
Data Generation
Task Formulation
Task Suite
Experiment
Experimental Setup
Models.
Metrics.
Settings.
Results
Direct Answer
Coarse Description
Inaccurate Perception and Hallucination
Swap Count
...and 22 more sections

Key Result

Theorem 1

For any fixed $k \ge 5$, $\mathrm{TRACK}_k$ is $\mathbf{NC}^1$-complete.

Figures (17)

Figure 1: Overview of VET-Bench.
Figure 2: Performance on VET-Bench, consisting of 50 cups-game and 50 cards-game videos featuring 3 objects and 5 swaps ($\sim$12 seconds). Existing VLMs all perform near random chance. Molmo2-SGCoT is a fine-tuned model based on Molmo2 that leverages Spatiotemporal Grounded Chain-of-Thought (SGCoT) to solve the shell game (Section \ref{['sec:sgcot']}).
Figure 3: Performance of VLMs under different swap and object counts.
Figure 4: Training and validation loss for direct-answer training on 500 synthetic VET-Bench cups-game videos.
Figure 5: Example frames from videos involving distinct cups in the Perception Test.
...and 12 more figures

Theorems & Definitions (10)

Definition 1: Visual Entity Tracking, $\mathrm{TRACK}_k$
Definition 2: Word Problem for $S_5$, $\mathrm{WORD}_{S_5}$
Theorem 1
proof : Proof Sketch
Lemma 1
proof
Lemma 2
proof
Theorem 1
proof

Can Vision-Language Models Solve the Shell Game?

TL;DR

Abstract

Can Vision-Language Models Solve the Shell Game?

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (10)