Table of Contents
Fetching ...

Text-to-Scene with Large Reasoning Models

Frédéric Berdoz, Luca A. Lanzendörfer, Nick Tuninga, Roger Wattenhofer

TL;DR

Reason-3D tackles text-to-3D scene synthesis under complex spatial instructions by leveraging Large Reasoning Models to perform object retrieval and placement in a zero-shot setting. The pipeline uses caption-based object descriptions to semantically retrieve assets from Objaverse, then employs autoregressive layout followed by collision-aware refinement to ensure spatial coherence without handcrafted rules. Key contributions include a dual-stage placement strategy, a size-after-rotation-aware refinement, and demonstrated generalization to outdoor scenes, along with a public code release. The work shows that modern LRMs can handle geometry, context, and physical constraints at scale, enabling more faithful and flexible scene generation for indoor and outdoor environments. Practically, Reason-3D offers a robust, language-driven path to open-ended 3D scene creation without task-specific training or restrictions on object libraries.

Abstract

Prompt-driven scene synthesis allows users to generate complete 3D environments from textual descriptions. Current text-to-scene methods often struggle with complex geometries and object transformations, and tend to show weak adherence to complex instructions. We address these limitations by introducing Reason-3D, a text-to-scene model powered by large reasoning models (LRMs). Reason-3D integrates object retrieval using captions covering physical, functional, and contextual attributes. Reason-3D then places the selected objects based on implicit and explicit layout constraints, and refines their positions with collision-aware spatial reasoning. Evaluated on instructions ranging from simple to complex indoor configurations, Reason-3D significantly outperforms previous methods in human-rated visual fidelity, adherence to constraints, and asset retrieval quality. Beyond its contribution to the field of text-to-scene generation, our work showcases the advanced spatial reasoning abilities of modern LRMs. Additionally, we release the codebase to further the research in object retrieval and placement with LRMs.

Text-to-Scene with Large Reasoning Models

TL;DR

Reason-3D tackles text-to-3D scene synthesis under complex spatial instructions by leveraging Large Reasoning Models to perform object retrieval and placement in a zero-shot setting. The pipeline uses caption-based object descriptions to semantically retrieve assets from Objaverse, then employs autoregressive layout followed by collision-aware refinement to ensure spatial coherence without handcrafted rules. Key contributions include a dual-stage placement strategy, a size-after-rotation-aware refinement, and demonstrated generalization to outdoor scenes, along with a public code release. The work shows that modern LRMs can handle geometry, context, and physical constraints at scale, enabling more faithful and flexible scene generation for indoor and outdoor environments. Practically, Reason-3D offers a robust, language-driven path to open-ended 3D scene creation without task-specific training or restrictions on object libraries.

Abstract

Prompt-driven scene synthesis allows users to generate complete 3D environments from textual descriptions. Current text-to-scene methods often struggle with complex geometries and object transformations, and tend to show weak adherence to complex instructions. We address these limitations by introducing Reason-3D, a text-to-scene model powered by large reasoning models (LRMs). Reason-3D integrates object retrieval using captions covering physical, functional, and contextual attributes. Reason-3D then places the selected objects based on implicit and explicit layout constraints, and refines their positions with collision-aware spatial reasoning. Evaluated on instructions ranging from simple to complex indoor configurations, Reason-3D significantly outperforms previous methods in human-rated visual fidelity, adherence to constraints, and asset retrieval quality. Beyond its contribution to the field of text-to-scene generation, our work showcases the advanced spatial reasoning abilities of modern LRMs. Additionally, we release the codebase to further the research in object retrieval and placement with LRMs.

Paper Structure

This paper contains 29 sections, 2 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Showcase comparison for object retrieval and placement between Reason-3D and baseline approaches for the instruction "A cozy living room of size 5 by 5 units. There is a plant on a small table in front of the L-shaped sofa."
  • Figure 2: Overview of our proposed architecture. Retrieval: We start by processing objects from an asset library, generating images of these objects for captions and orientation. A captioning model creates object descriptions, which are then turned into embedding vectors and stored in a vector database. Given an instruction, we extract a list of objects that would be feasible according to the instruction. and subsequently query the database for such objects. Placement: Given the instruction and retrieved objects, we extract a set of constraints to determine an ordered sequence for object placement. Once all objects are placed, the scene is refined by calculating and adjusting for object collisions.
  • Figure 3: Qualitative comparison for object retrieval and placement across various scenes. We find that overall, Reason-3D can better follow instructions and place objects reasonably. Compared to Holodeck and Reason-3D, LayoutVLM was not designed to retrieve objects. We use the objects retrieved from Reason-3D for LayoutVLM.
  • Figure 4: Qualitative comparison of object placement performance when instruction complexity is increased. Every scene is generated from scratch with the entire instruction given up to and including its circled number.
  • Figure 5: We benchmark various LRMs on three scenes. We find that Gemini 2.5 Pro achieves the best overall performance on spatial reasoning tasks. The bridge-building task also showcases the ability of object composition (building a bridge out of individual stone objects).
  • ...and 6 more figures