LayoutVLM: Differentiable Optimization of 3D Layout via Vision-Language Models
Fan-Yun Sun, Weiyu Liu, Siyi Gu, Dylan Lim, Goutam Bhat, Federico Tombari, Manling Li, Nick Haber, Jiajun Wu
TL;DR
LayoutVLM tackles open-universe 3D layout generation by marrying Vision-Language Model grounding with differentiable optimization over a dual-representation scene layout: numerical poses $\hat{p}_i=(x_i,y_i,z_i,\theta_i)$ and spatial relations $\mathcal{R}$. A self-consistent decoding mechanism and visual prompting improve spatial grounding, while a differentiable objective $\mathcal{L}_{\text{semantic}} + \mathcal{L}_{\text{physics}}$ yields physically plausible, semantically aligned layouts. The approach shows strong gains over baselines across 11 room types and benefits from fine-tuning VLMs on scene data, with ablations confirming the necessity of visual cues and self-consistency. The results highlight the potential of VLM-guided differentiable optimization for scalable, open-vocabulary 3D scene generation with practical impact for robotics, simulation, and design.
Abstract
Spatial reasoning is a fundamental aspect of human cognition, enabling intuitive understanding and manipulation of objects in three-dimensional space. While foundation models demonstrate remarkable performance on some benchmarks, they still struggle with 3D reasoning tasks like arranging objects in space according to open-ended language instructions, particularly in dense and physically constrained environments. We introduce LayoutVLM, a framework and scene layout representation that exploits the semantic knowledge of Vision-Language Models (VLMs) and supports differentiable optimization to ensure physical plausibility. LayoutVLM employs VLMs to generate two mutually reinforcing representations from visually marked images, and a self-consistent decoding process to improve VLMs spatial planning. Our experiments show that LayoutVLM addresses the limitations of existing LLM and constraint-based approaches, producing physically plausible 3D layouts better aligned with the semantic intent of input language instructions. We also demonstrate that fine-tuning VLMs with the proposed scene layout representation extracted from existing scene datasets can improve their reasoning performance.
