AnyHome: Open-Vocabulary Generation of Structured and Textured 3D Homes

Rao Fu; Zehao Wen; Zichen Liu; Srinath Sridhar

AnyHome: Open-Vocabulary Generation of Structured and Textured 3D Homes

Rao Fu, Zehao Wen, Zichen Liu, Srinath Sridhar

TL;DR

AnyHome presents a two-stage, text-controlled pipeline that translates open-vocabulary text into house-scale 3D indoor scenes with structured geometry and textured realism. By combining LLM-driven modular descriptions, amodal hierarchical geometry, a graph-based floorplan and room-layout generation, SDS-based refinement, and egocentric texture inpainting, the method achieves robust open-vocabulary generation and editing while maintaining structural coherence. The approach demonstrates strong improvements over baselines in layout quality and texture consistency, enabling diverse, editable interiors for interior design, gaming, AR/VR, and embodied-agent training. This work advances open-vocabulary 3D scene synthesis by integrating language-driven planning, graph-based control, and view-consistent texture generation, paving the way for scalable, richly textured, navigable indoor environments.

Abstract

Inspired by cognitive theories, we introduce AnyHome, a framework that translates any text into well-structured and textured indoor scenes at a house-scale. By prompting Large Language Models (LLMs) with designed templates, our approach converts provided textual narratives into amodal structured representations. These representations guarantee consistent and realistic spatial layouts by directing the synthesis of a geometry mesh within defined constraints. A Score Distillation Sampling process is then employed to refine the geometry, followed by an egocentric inpainting process that adds lifelike textures to it. AnyHome stands out with its editability, customizability, diversity, and realism. The structured representations for scenes allow for extensive editing at varying levels of granularity. Capable of interpreting texts ranging from simple labels to detailed narratives, AnyHome generates detailed geometries and textures that outperform existing methods in both quantitative and qualitative measures.

AnyHome: Open-Vocabulary Generation of Structured and Textured 3D Homes

TL;DR

Abstract

Paper Structure (31 sections, 1 equation, 23 figures, 3 tables, 1 algorithm)

This paper contains 31 sections, 1 equation, 23 figures, 3 tables, 1 algorithm.

Introduction
Related Work
AnyHome
Textual Input Modulation.
Hierarchical Structured Geometry Generation.
House Floorplan Generation.
Room Layout and Object Placement Generation.
Object Retrieval.
Egocentric Refinement and Inpainting.
Generating Egocentric Trajectories.
Refinement and Inpainting with Text-to-Image Models.
Results
Open-vocabulary Scene Generation.
Open-vocabulary Editing.
Diversity.
...and 16 more sections

Figures (23)

Figure 1: Example house-scale indoor scene generated by AnyHome. Users can input any textual description of an indoor scene, and the system is capable of generating house floorplans, room layouts, object placements, and stylistic appearances accordingly. The generated indoor scene is represented by structured and textured room and object meshes. AnyHome enables the synthesis of diverse indoor scenes, allowing users to control scene generation at any stage—from textual input and intermediate representation to the generated meshes.
Figure 2: Two-stage Generation Process.AnyHome unfolds two primary steps: First, amodal representations are generated from user's text input, which involves constructing modular text descriptions, constrained graphs, and hierarchical structured base mesh. Following this, the method embarks on an egocentric exploration stage, where a navigation trajectory is generated, enabling the refinement and texturing of the base mesh from different viewpoints.
Figure 3: Pipeline. Taking a free-form textual input, our pipeline generates the house-scale scene by: (i) comprehending and elaborating on the user's textual input through querying an LLM with templated prompts; (ii) converting textual descriptions into base geometry using structured intermediate representations; (iii) employing an SDS process with a differentiable renderer to refine object placements; and (iv) applying depth-conditioned texture inpainting for egocentric texture generation.
Figure 4: Open-Vocabulary Generation Results. Top: Input text prompt. Middle: Bird's-eye view of the scenes. Bottom: Egocentric view of the scenes. AnyHome interprets users' textual inputs and produces structured scenes with realistic textures. It can create a serene and culturally rich environment (Left - "Japanese tea house"), render a more dramatic and stylized ambiance (Middle - "haunted house"), and synthesize unique house types (Right - "cat cafe").
Figure 5: Open-vocabulary Editing Results. Examples showcase AnyHome's capability to modify room types, layouts, object appearances, and overall design through free-form user input. AnyHome also supports comprehensive style alterations and sequential edits, all made possible by its hierarchical structured geometric representation and robust text controllability.
...and 18 more figures

AnyHome: Open-Vocabulary Generation of Structured and Textured 3D Homes

TL;DR

Abstract

AnyHome: Open-Vocabulary Generation of Structured and Textured 3D Homes

Authors

TL;DR

Abstract

Table of Contents

Figures (23)