Table of Contents
Fetching ...

MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks

Lirong Che, Shuo Wen, Shan Huang, Chuang Wang, Yuzhe Yang, Gregory Dudek, Xueqian Wang, Jian Su

Abstract

Real-world robotic tasks are long-horizon and often span multiple floors, demanding rich spatial reasoning. However, existing embodied benchmarks are largely confined to single-floor in-house environments, failing to reflect the complexity of real-world tasks. We introduce MANSION, the first language-driven framework for generating building-scale, multi-floor 3D environments. Being aware of vertical structural constraints, MANSION generates realistic, navigable whole-building structures with diverse, human-friendly scenes, enabling the development and evaluation of cross-floor long-horizon tasks. Building on this framework, we release MansionWorld, a dataset of over 1,000 diverse buildings ranging from hospitals to offices, alongside a Task-Semantic Scene Editing Agent that customizes these environments using open-vocabulary commands to meet specific user needs. Benchmarking reveals that state-of-the-art agents degrade sharply in our settings, establishing MANSION as a critical testbed for the next generation of spatial reasoning and planning.

MANSION: Multi-floor lANguage-to-3D Scene generatIOn for loNg-horizon tasks

Abstract

Real-world robotic tasks are long-horizon and often span multiple floors, demanding rich spatial reasoning. However, existing embodied benchmarks are largely confined to single-floor in-house environments, failing to reflect the complexity of real-world tasks. We introduce MANSION, the first language-driven framework for generating building-scale, multi-floor 3D environments. Being aware of vertical structural constraints, MANSION generates realistic, navigable whole-building structures with diverse, human-friendly scenes, enabling the development and evaluation of cross-floor long-horizon tasks. Building on this framework, we release MansionWorld, a dataset of over 1,000 diverse buildings ranging from hospitals to offices, alongside a Task-Semantic Scene Editing Agent that customizes these environments using open-vocabulary commands to meet specific user needs. Benchmarking reveals that state-of-the-art agents degrade sharply in our settings, establishing MANSION as a critical testbed for the next generation of spatial reasoning and planning.
Paper Structure (40 sections, 5 equations, 25 figures, 6 tables, 4 algorithms)

This paper contains 40 sections, 5 equations, 25 figures, 6 tables, 4 algorithms.

Figures (25)

  • Figure 1: MansionWorld: The first building-scale dataset for long-horizon embodied AI tasks. Generated by our MANSION framework, this dataset represents the first large-scale collection of multi-story, customizable themed environments. The visualization highlights four representative examples: Kindergarten, Hospital, Supermarket, and a Six-story Office Building, which feature complex functional zoning and fully navigable vertical connections to support long-horizon, cross-floor embodied AI tasks. You can access the MansionWorld dataset at: https://huggingface.co/datasets/superbigsaw/MansionWorld
  • Figure 2: Overview of the MANSION framework: a multi-agent-driven pipeline for generating multi-story 3D buildings from natural language. The process includes: (A) Whole Building Planning, (B) Per-Floor Planning, (C) Floorplan Synthesis, and (D) Scene Instantiation.
  • Figure 3: MansionWorld statistics: functional composition and floor-area distributions across different floor counts.
  • Figure 4: The "Check-and-Provision" workflow of our Task-Semantic Scene Editing Agent. The agent first decomposes a high-level instruction ("bring a snack and a drink to the sofa") into preconditions. It then sequentially performs a (a) Path Connectivity Check, an (b) Object Availability Check, and an (c) Object Provisioning & Scene Edit to ensure the task is executable before generation.
  • Figure 5: object placement qualitative comparison.
  • ...and 20 more figures