Table of Contents
Fetching ...

From Perception to Action: An Interactive Benchmark for Vision Reasoning

Yuhao Wu, Maojia Song, Yihuai Lan, Lei Wang, Zhiqiang Hu, Yao Xiao, Heng Zhou, Weihua Zheng, Dylan Raharja, Soujanya Poria, Roy Ka-Wei Lee

TL;DR

This work conducts a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings, showing that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions.

Abstract

Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents' ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions. The project is available at https://social-ai-studio.github.io/CHAIN/.

From Perception to Action: An Interactive Benchmark for Vision Reasoning

TL;DR

This work conducts a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings, showing that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions.

Abstract

Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents' ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions. The project is available at https://social-ai-studio.github.io/CHAIN/.
Paper Structure (45 sections, 7 equations, 9 figures, 5 tables, 2 algorithms)

This paper contains 45 sections, 7 equations, 9 figures, 5 tables, 2 algorithms.

Figures (9)

  • Figure 1: Static vs. Interactive Evaluation of Physical Struction Reasoning.(a) Traditional VQA relies on passive observation of an image. (b) Our paradigm requires multi-step interaction, enabling procedural evaluation of planning and structural understanding.
  • Figure 2: Benchmark construction pipeline. We illustrate the end-to-end process for building our benchmark, including problem sourcing, document collection and filtering, concept annotation, regime construction, and final evaluation setup. The pipeline is designed to ensure controlled difficulty, minimize parametric leakage, and enable fine-grained analysis of reasoning and retrieval behaviors.
  • Figure 3: Cost and token efficiency with solved tasks comparison between models
  • Figure 4: Qualitative results on the Luban puzzle subtask across world models. Top: Level 1 (two beams). Bottom: Level 2 (six beams). All models fail to produce a physically valid disassembly, either violating interlocking constraints or hallucinating (e.g., structural corruption and object insertion/removal), with failures worsening at higher complexity.
  • Figure 5: Examples of different levels of puzzle task.
  • ...and 4 more figures