Table of Contents
Fetching ...

PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly

Liang Ma, Jiajun Wen, Min Lin, Rongtao Xu, Xiwen Liang, Bingqian Lin, Jun Ma, Yongxin Wang, Ziming Wei, Haokun Lin, Mingfei Han, Meng Cao, Bokui Chen, Ivan Laptev, Xiaodan Liang

TL;DR

PhyBlock addresses the gap in evaluating physically grounded, long-horizon planning in Vision-Language Models by introducing two aligned benchmarks: Hierarchical Assembly Planning and Physics Understanding VQA, built on a physics-enabled 3D block world. It provides 400 scenes across Level-1 to Level-4 and 2,200 VQA items, with AOV-based diagnostics to separate partial completion, failure diagnosis, and planning robustness. The study benchmarks 21 VLMs, revealing strong perceptual abilities but clear deficits in high-level physical reasoning and sequential planning, with orientation and dependency errors dominating and limited benefits from chain-of-thought prompts. The results motivate integrating explicit physical priors and interactive feedback into multimodal models for robust embodied intelligence.

Abstract

While vision-language models (VLMs) have demonstrated promising capabilities in reasoning and planning for embodied agents, their ability to comprehend physical phenomena, particularly within structured 3D environments, remains severely limited. To close this gap, we introduce PhyBlock, a progressive benchmark designed to assess VLMs on physical understanding and planning through robotic 3D block assembly tasks. PhyBlock integrates a novel four-level cognitive hierarchy assembly task alongside targeted Visual Question Answering (VQA) samples, collectively aimed at evaluating progressive spatial reasoning and fundamental physical comprehension, including object properties, spatial relationships, and holistic scene understanding. PhyBlock includes 2600 block tasks (400 assembly tasks, 2200 VQA tasks) and evaluates models across three key dimensions: partial completion, failure diagnosis, and planning robustness. We benchmark 21 state-of-the-art VLMs, highlighting their strengths and limitations in physically grounded, multi-step planning. Our empirical findings indicate that the performance of VLMs exhibits pronounced limitations in high-level planning and reasoning capabilities, leading to a notable decline in performance for the growing complexity of the tasks. Error analysis reveals persistent difficulties in spatial orientation and dependency reasoning. Surprisingly, chain-of-thought prompting offers minimal improvements, suggesting spatial tasks heavily rely on intuitive model comprehension. We position PhyBlock as a unified testbed to advance embodied reasoning, bridging vision-language understanding and real-world physical problem-solving.

PhyBlock: A Progressive Benchmark for Physical Understanding and Planning via 3D Block Assembly

TL;DR

PhyBlock addresses the gap in evaluating physically grounded, long-horizon planning in Vision-Language Models by introducing two aligned benchmarks: Hierarchical Assembly Planning and Physics Understanding VQA, built on a physics-enabled 3D block world. It provides 400 scenes across Level-1 to Level-4 and 2,200 VQA items, with AOV-based diagnostics to separate partial completion, failure diagnosis, and planning robustness. The study benchmarks 21 VLMs, revealing strong perceptual abilities but clear deficits in high-level physical reasoning and sequential planning, with orientation and dependency errors dominating and limited benefits from chain-of-thought prompts. The results motivate integrating explicit physical priors and interactive feedback into multimodal models for robust embodied intelligence.

Abstract

While vision-language models (VLMs) have demonstrated promising capabilities in reasoning and planning for embodied agents, their ability to comprehend physical phenomena, particularly within structured 3D environments, remains severely limited. To close this gap, we introduce PhyBlock, a progressive benchmark designed to assess VLMs on physical understanding and planning through robotic 3D block assembly tasks. PhyBlock integrates a novel four-level cognitive hierarchy assembly task alongside targeted Visual Question Answering (VQA) samples, collectively aimed at evaluating progressive spatial reasoning and fundamental physical comprehension, including object properties, spatial relationships, and holistic scene understanding. PhyBlock includes 2600 block tasks (400 assembly tasks, 2200 VQA tasks) and evaluates models across three key dimensions: partial completion, failure diagnosis, and planning robustness. We benchmark 21 state-of-the-art VLMs, highlighting their strengths and limitations in physically grounded, multi-step planning. Our empirical findings indicate that the performance of VLMs exhibits pronounced limitations in high-level planning and reasoning capabilities, leading to a notable decline in performance for the growing complexity of the tasks. Error analysis reveals persistent difficulties in spatial orientation and dependency reasoning. Surprisingly, chain-of-thought prompting offers minimal improvements, suggesting spatial tasks heavily rely on intuitive model comprehension. We position PhyBlock as a unified testbed to advance embodied reasoning, bridging vision-language understanding and real-world physical problem-solving.

Paper Structure

This paper contains 49 sections, 40 figures, 5 tables, 1 algorithm.

Figures (40)

  • Figure 1: Assembly Planning Task in PhyBlock. Here shows inference setting of two planning strategies(one-time full planning and step-by-step planning).
  • Figure 2: Physics Understanding VQA in PhyBlock. We construct a compact set of questions per 3D assembly scene, covering four key dimensions of physical and spatial reasoning to assess diverse aspects of the model’s understanding of 3D block assembly.
  • Figure 3: Comprehensive Comparison of Mainstream Models Across Evaluation Dimensions. (a) We conduct a comprehensive comparison of six representative models under both A and B evaluation settings across all four task difficulty levels. (b) The differences between two Evaluation Setting are illustrated. For a detailed explanation, please refer to Appendix \ref{['sub:B2']}. (c) A focused analysis on GPT-o1 reveals its performance under the two evaluation settings. Interestingly, we observe a significant performance boost when the strict constraint on pose alignment is relaxed, highlighting the model's potential under less rigid spatial requirements.
  • Figure 3: Results (%) overview. Step-by-step interactive reasoning results with the environment.
  • Figure 4: Error Type Analysis of Assembly Steps. We systematically analyze four distinct types of errors encountered during the planning process for each sample. A detailed definition and categorization of the four error types can be found in the Appendix \ref{['sub:errortype']}.
  • ...and 35 more figures