Table of Contents
Fetching ...

LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models

Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, Jiajun Wu

TL;DR

LayoutVLM tackles open-universe 3D layout generation by marrying Vision-Language Model grounding with differentiable optimization over a dual-representation scene layout: numerical poses $\hat{p}_i=(x_i,y_i,z_i,\theta_i)$ and spatial relations $\mathcal{R}$. A self-consistent decoding mechanism and visual prompting improve spatial grounding, while a differentiable objective $\mathcal{L}_{\text{semantic}} + \mathcal{L}_{\text{physics}}$ yields physically plausible, semantically aligned layouts. The approach shows strong gains over baselines across 11 room types and benefits from fine-tuning VLMs on scene data, with ablations confirming the necessity of visual cues and self-consistency. The results highlight the potential of VLM-guided differentiable optimization for scalable, open-vocabulary 3D scene generation with practical impact for robotics, simulation, and design.

Abstract

Spatial reasoning is a fundamental aspect of human cognition, enabling intuitive understanding and manipulation of objects in three-dimensional space. While foundation models demonstrate remarkable performance on some benchmarks, they still struggle with 3D reasoning tasks like arranging objects in space according to open-ended language instructions, particularly in dense and physically constrained environments. We introduce LayoutVLM, a framework and scene layout representation that exploits the semantic knowledge of Vision-Language Models (VLMs) and supports differentiable optimization to ensure physical plausibility. LayoutVLM employs VLMs to generate two mutually reinforcing representations from visually marked images, and a self-consistent decoding process to improve VLMs spatial planning. Our experiments show that LayoutVLM addresses the limitations of existing LLM and constraint-based approaches, producing physically plausible 3D layouts better aligned with the semantic intent of input language instructions. We also demonstrate that fine-tuning VLMs with the proposed scene layout representation extracted from existing scene datasets can improve their reasoning performance.

LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models

TL;DR

LayoutVLM tackles open-universe 3D layout generation by marrying Vision-Language Model grounding with differentiable optimization over a dual-representation scene layout: numerical poses and spatial relations . A self-consistent decoding mechanism and visual prompting improve spatial grounding, while a differentiable objective yields physically plausible, semantically aligned layouts. The approach shows strong gains over baselines across 11 room types and benefits from fine-tuning VLMs on scene data, with ablations confirming the necessity of visual cues and self-consistency. The results highlight the potential of VLM-guided differentiable optimization for scalable, open-vocabulary 3D scene generation with practical impact for robotics, simulation, and design.

Abstract

Spatial reasoning is a fundamental aspect of human cognition, enabling intuitive understanding and manipulation of objects in three-dimensional space. While foundation models demonstrate remarkable performance on some benchmarks, they still struggle with 3D reasoning tasks like arranging objects in space according to open-ended language instructions, particularly in dense and physically constrained environments. We introduce LayoutVLM, a framework and scene layout representation that exploits the semantic knowledge of Vision-Language Models (VLMs) and supports differentiable optimization to ensure physical plausibility. LayoutVLM employs VLMs to generate two mutually reinforcing representations from visually marked images, and a self-consistent decoding process to improve VLMs spatial planning. Our experiments show that LayoutVLM addresses the limitations of existing LLM and constraint-based approaches, producing physically plausible 3D layouts better aligned with the semantic intent of input language instructions. We also demonstrate that fine-tuning VLMs with the proposed scene layout representation extracted from existing scene datasets can improve their reasoning performance.

Paper Structure

This paper contains 33 sections, 8 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: From unlabeled 3D assets and language instruction, LayoutVLM generates scene layouts that are physically plausible and semantically coherent—two criteria that existing methods often struggle to meet. Our approach addresses this by using a VLM to generate a scene layout representation that defines both an initial layout and spatial relations between assets for differentiable optimization.
  • Figure 2: Example Scene Representation. Example of our scene representation for a bedroom. Our scene representation consists of numerical estimates of object poses and spatial relations corresponding to objective functions on these poses. Having the VLMs generate the initial estimates allows us to exploit the semantic knowledge in the large models, and having spatial relations amenable to optimization allows us to generate physically precise placements.
  • Figure 3: LayoutVLM. We illustrate the proposed process of generating 3D scene layout with Vision-Language Models.
  • Figure 4: Qualitative Comparison. We compare with baseline methods in generating layouts based on detailed language instructions. Our method is able to generate layouts that closely follow the instructions and adhere to physical constraints.
  • Figure 5: Examples of Following Detailed Instructions. We show the same set of assets arranged with different language instructions. The latter two examples show that LayoutVLM can closely follow the prompts even when the desired layouts are unconventional.
  • ...and 2 more figures