Text-to-Scene with Large Reasoning Models

Frédéric Berdoz; Luca A. Lanzendörfer; Nick Tuninga; Roger Wattenhofer

Text-to-Scene with Large Reasoning Models

Frédéric Berdoz, Luca A. Lanzendörfer, Nick Tuninga, Roger Wattenhofer

TL;DR

Reason-3D tackles text-to-3D scene synthesis under complex spatial instructions by leveraging Large Reasoning Models to perform object retrieval and placement in a zero-shot setting. The pipeline uses caption-based object descriptions to semantically retrieve assets from Objaverse, then employs autoregressive layout followed by collision-aware refinement to ensure spatial coherence without handcrafted rules. Key contributions include a dual-stage placement strategy, a size-after-rotation-aware refinement, and demonstrated generalization to outdoor scenes, along with a public code release. The work shows that modern LRMs can handle geometry, context, and physical constraints at scale, enabling more faithful and flexible scene generation for indoor and outdoor environments. Practically, Reason-3D offers a robust, language-driven path to open-ended 3D scene creation without task-specific training or restrictions on object libraries.

Abstract

Prompt-driven scene synthesis allows users to generate complete 3D environments from textual descriptions. Current text-to-scene methods often struggle with complex geometries and object transformations, and tend to show weak adherence to complex instructions. We address these limitations by introducing Reason-3D, a text-to-scene model powered by large reasoning models (LRMs). Reason-3D integrates object retrieval using captions covering physical, functional, and contextual attributes. Reason-3D then places the selected objects based on implicit and explicit layout constraints, and refines their positions with collision-aware spatial reasoning. Evaluated on instructions ranging from simple to complex indoor configurations, Reason-3D significantly outperforms previous methods in human-rated visual fidelity, adherence to constraints, and asset retrieval quality. Beyond its contribution to the field of text-to-scene generation, our work showcases the advanced spatial reasoning abilities of modern LRMs. Additionally, we release the codebase to further the research in object retrieval and placement with LRMs.

Text-to-Scene with Large Reasoning Models

TL;DR

Abstract

Text-to-Scene with Large Reasoning Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)