MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks

Lirong Che; Shuo Wen; Shan Huang; Chuang Wang; Yuzhe Yang; Gregory Dudek; Xueqian Wang; Jian Su

MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks

Lirong Che, Shuo Wen, Shan Huang, Chuang Wang, Yuzhe Yang, Gregory Dudek, Xueqian Wang, Jian Su

Abstract

Real-world robotic tasks are long-horizon and often span multiple floors, demanding rich spatial reasoning. However, existing embodied benchmarks are largely confined to single-floor in-house environments, failing to reflect the complexity of real-world tasks. We introduce MANSION, the first language-driven framework for generating building-scale, multi-floor 3D environments. Being aware of vertical structural constraints, MANSION generates realistic, navigable whole-building structures with diverse, human-friendly scenes, enabling the development and evaluation of cross-floor long-horizon tasks. Building on this framework, we release MansionWorld, a dataset of over 1,000 diverse buildings ranging from hospitals to offices, alongside a Task-Semantic Scene Editing Agent that customizes these environments using open-vocabulary commands to meet specific user needs. Benchmarking reveals that state-of-the-art agents degrade sharply in our settings, establishing MANSION as a critical testbed for the next generation of spatial reasoning and planning.

MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks

Abstract

Paper Structure (40 sections, 5 equations, 25 figures, 6 tables, 4 algorithms)

This paper contains 40 sections, 5 equations, 25 figures, 6 tables, 4 algorithms.

Introduction
Related Work
MANSION
MANSION Framework
MANSION Ecosystem
Experiments
Floorplan Generation Algorithm
Object Placement Evaluation
Embodied algorithms in MANSION
Conclusion
Additional Qualitative Results
Qualitative Floorplan Comparison
Qualitative comparison with Holodeck
Structural Flexibility and Physical Fidelity
MansionWorld Dataset Details
...and 25 more sections

Figures (25)

Figure 1: MansionWorld: The first building-scale dataset for long-horizon embodied AI tasks. Generated by our MANSION framework, this dataset represents the first large-scale collection of multi-story, customizable themed environments. The visualization highlights four representative examples: Kindergarten, Hospital, Supermarket, and a Six-story Office Building, which feature complex functional zoning and fully navigable vertical connections to support long-horizon, cross-floor embodied AI tasks. You can access the MansionWorld dataset at: https://huggingface.co/datasets/superbigsaw/MansionWorld
Figure 2: Overview of the MANSION framework: a multi-agent-driven pipeline for generating multi-story 3D buildings from natural language. The process includes: (A) Whole Building Planning, (B) Per-Floor Planning, (C) Floorplan Synthesis, and (D) Scene Instantiation.
Figure 3: MansionWorld statistics: functional composition and floor-area distributions across different floor counts.
Figure 4: The "Check-and-Provision" workflow of our Task-Semantic Scene Editing Agent. The agent first decomposes a high-level instruction ("bring a snack and a drink to the sofa") into preconditions. It then sequentially performs a (a) Path Connectivity Check, an (b) Object Availability Check, and an (c) Object Provisioning & Scene Edit to ensure the task is executable before generation.
Figure 5: object placement qualitative comparison.
...and 20 more figures

MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks

Abstract

MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks

Authors

Abstract

Table of Contents

Figures (25)