Table of Contents
Fetching ...

How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning

Luyu Yang, Yutong Dai, An Yan, Viraj Prabhu, Ran Xu, Zeyuan Chen

Abstract

The physical world is not merely visual; it is governed by rigorous structural and procedural constraints. Yet, the evaluation of vision-language models (VLMs) remains heavily skewed toward perceptual realism, prioritizing the generation of visually plausible 3D layouts, shapes, and appearances. Current benchmarks rarely test whether models grasp the step-by-step processes and physical dependencies required to actually build these artifacts, a capability essential for automating design-to-construction pipelines. To address this, we introduce DreamHouse, a novel benchmark for physical generative reasoning: the capacity to synthesize artifacts that concurrently satisfy geometric, structural, constructability, and code-compliance constraints. We ground this benchmark in residential timber-frame construction, a domain with fully codified engineering standards and objectively verifiable correctness. We curate over 26,000 structures spanning 13 architectural styles, ach verified to construction-document standards (LOD 350) and develop a deterministic 10-test structural validation framework. Unlike static benchmarks that assess only final outputs, DreamHouse supports iterative agentic interaction. Models observe intermediate build states, generate construction actions, and receive structured environmental feedback, enabling a fine-grained evaluation of planning, structural reasoning, and self-correction. Extensive experiments with state-of-the-art VLMs reveal substantial capability gaps that are largely invisible on existing leaderboards. These findings establish physical validity as a critical evaluation axis orthogonal to visual realism, highlighting physical generative reasoning as a distinct and underdeveloped frontier in multimodal intelligence. Available at https://luluyuyuyang.github.io/dreamhouse

How Far Are Vision-Language Models from Constructing the Real World? A Benchmark for Physical Generative Reasoning

Abstract

The physical world is not merely visual; it is governed by rigorous structural and procedural constraints. Yet, the evaluation of vision-language models (VLMs) remains heavily skewed toward perceptual realism, prioritizing the generation of visually plausible 3D layouts, shapes, and appearances. Current benchmarks rarely test whether models grasp the step-by-step processes and physical dependencies required to actually build these artifacts, a capability essential for automating design-to-construction pipelines. To address this, we introduce DreamHouse, a novel benchmark for physical generative reasoning: the capacity to synthesize artifacts that concurrently satisfy geometric, structural, constructability, and code-compliance constraints. We ground this benchmark in residential timber-frame construction, a domain with fully codified engineering standards and objectively verifiable correctness. We curate over 26,000 structures spanning 13 architectural styles, ach verified to construction-document standards (LOD 350) and develop a deterministic 10-test structural validation framework. Unlike static benchmarks that assess only final outputs, DreamHouse supports iterative agentic interaction. Models observe intermediate build states, generate construction actions, and receive structured environmental feedback, enabling a fine-grained evaluation of planning, structural reasoning, and self-correction. Extensive experiments with state-of-the-art VLMs reveal substantial capability gaps that are largely invisible on existing leaderboards. These findings establish physical validity as a critical evaluation axis orthogonal to visual realism, highlighting physical generative reasoning as a distinct and underdeveloped frontier in multimodal intelligence. Available at https://luluyuyuyang.github.io/dreamhouse

Paper Structure

This paper contains 95 sections, 27 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: DreamHouse benchmark samples. Each row shows a single structure across five representations: material schedule (member counts by subsystem), foundation, two intermediate framing stages, and complete timber frame, alongside the target exterior rendering from which the model must infer the hidden structural system. Three of the 13 architectural styles are shown: Split-Level (678 members), Cruciform (1,350 members), and Barn (845 members). Member counts span four subsystems: foundation (Fdn), floor, walls, and roof.
  • Figure 2: DreamHouse Dataset Overview.(Top) Dataset statistics showing member count distributions (box plots, left axis) and style proportions (dashed line, right axis) across all 13 architectural styles, ordered by decreasing structural complexity. Member counts range from 133 to 1,548 (mean 673). (Bottom) Representative Cycles-rendered timber frame structures for each style: Courtyard, Cruciform, Colonial, Split-Level, Barn, Ranch, Townhouse, Z-Plan, Carriage, Farmhouse, Saltbox, Shotgun, A-Frame, spanning complex multi-wing configurations to compact single-story forms.
  • Figure 3: Structural Validation Suite. 10 tests across four pillars. IRC Compliance (left): load path, span limits, completeness, stability score. Structure Physics (center-left): L/360 deflection, cantilever ratio, point load bearing. Geometric Integrity (center-right): dual-end restraint, gap detection, roof coverage. LoD 350 (right): fabrication-grade member geometry.
  • Figure 4: Task formalization example Planner-Managed.Top (agent loop): Evaluation is formalized as a recurrent agentic process. At each turn $t$, the agent $\mathcal{A}$ (VLM) receives observation $o_t = (I_0,\, f_{t-1})$, the original multi-view task image $I_0$ and structured validation feedback $f_{t-1}$ from the previous turn, and generates a Blender Python action $a_t$. The executor $\mathcal{E}$ applies $a_t$ to transition the scene graph $s_{t-1} \to s_t$, and the validator $\mathcal{V}$ produces feedback $f_t$ reporting per-test pass/fail and violation counts. This feedback becomes the input observation for the next turn, closing the loop. On failure (e.g., $f_2$, red), the agent retries from the same scene state $s_2$ without resetting context. $\mathcal{A}$, $\mathcal{E}$, and $\mathcal{V}$ are omitted from the diagram for visual clarity; see Section \ref{['sec:tasks']} for full formalization. Bottom (task instantiation): The task input $I_0$ consists of five rendered views of the target structure paired with building context and rules. The agent reasons over these to produce and iteratively revise construction code; the environment executes the code in Blender and the validator returns structured diagnostic feedback driving the next revision.
  • Figure 5: Planner-Managed qualitative example AF-01-0060 (A-frame style). All three models begin from scratch and receive stepwise visual feedback toward the same target structure. Despite reaching a valid result at step 6, Gemini and Claude employ markedly different construction strategies. Gemini pursues a top-down, shape-first approach: it approximates the overall silhouette early and refines toward it. Claude reasons bottom-up: it first establishes a structurally sound interior frame, then lays the roof rafters over it in the final step -- a sequence closer to how a builder would physically construct an A-frame. GPT-5 fails to recognize the defining constraint of A-frame geometry, that the roof planes double as load-bearing walls, and from step 6 onward enters a false loop of adding conventional wall studs along the perimeter. Unable to escape this structural misconception, it exhausts all attempts without producing a valid result. This example highlights that identical visual feedback can elicit fundamentally different reasoning strategies, and that success depends not just on visual matching ability but on implicit architectural knowledge.
  • ...and 9 more figures

Theorems & Definitions (5)

  • Definition 1: Contact Relation
  • Definition 2: Ground Set
  • Definition 3: Support Function
  • Definition 4: Compliant Spacing
  • Definition 5: Zone Connection