SAGE: Scalable Agentic 3D Scene Generation for Embodied AI

Hongchi Xia; Xuan Li; Zhaoshuo Li; Qianli Ma; Jiashu Xu; Ming-Yu Liu; Yin Cui; Tsung-Yi Lin; Wei-Chiu Ma; Shenlong Wang; Shuran Song; Fangyin Wei

SAGE: Scalable Agentic 3D Scene Generation for Embodied AI

Hongchi Xia, Xuan Li, Zhaoshuo Li, Qianli Ma, Jiashu Xu, Ming-Yu Liu, Yin Cui, Tsung-Yi Lin, Wei-Chiu Ma, Shenlong Wang, Shuran Song, Fangyin Wei

TL;DR

SAGE addresses the data bottleneck in embodied AI by generating open-domain, physically valid 3D indoor scenes directly from user prompts. It combines an agentic MCP-driven orchestration of scene generators with dual visual and physics critics, enabling self-correcting, simulator-validated scene creation. Through multi-level scene augmentation and automatic action synthesis, SAGE delivers scalable, diverse, and simulation-ready data that improves policy learning and generalization to unseen objects and layouts. The work demonstrates clear scaling benefits in embodied policy performance and provides a large SAGE-10k dataset to accelerate research, with extensions to multi-room layouts, image-conditioned generation, and articulated objects.

Abstract

Real-world data collection for embodied agents remains costly and unsafe, calling for scalable, realistic, and simulator-ready 3D environments. However, existing scene-generation systems often rely on rule-based or task-specific pipelines, yielding artifacts and physically invalid scenes. We present SAGE, an agentic framework that, given a user-specified embodied task (e.g., "pick up a bowl and place it on the table"), understands the intent and automatically generates simulation-ready environments at scale. The agent couples multiple generators for layout and object composition with critics that evaluate semantic plausibility, visual realism, and physical stability. Through iterative reasoning and adaptive tool selection, it self-refines the scenes until meeting user intent and physical validity. The resulting environments are realistic, diverse, and directly deployable in modern simulators for policy training. Policies trained purely on this data exhibit clear scaling trends and generalize to unseen objects and layouts, demonstrating the promise of simulation-driven scaling for embodied AI. Code, demos, and the SAGE-10k dataset can be found on the project page here: https://nvlabs.github.io/sage.

SAGE: Scalable Agentic 3D Scene Generation for Embodied AI

TL;DR

Abstract

Paper Structure (79 sections, 11 figures, 16 tables)

This paper contains 79 sections, 11 figures, 16 tables.

Introduction
Related Works
3D Indoor Scene Synthesis
Simulation Environment for Embodied AI
Method
Agent-driven Scene Generation
Generator
Scene Initializer
Asset Placer
Asset Mover
Asset Remover
Critic for Self-Improvement
Visual Critic
Physics Critic
Scaling the Scene for Embodied AI
...and 64 more sections

Figures (11)

Figure 1: Overview and example outputs of SAGE. Given an open-ended user request, our system autonomously composes realistic, diverse, and simulation-ready 3D environments. The generated scenes are directly deployable in modern simulators, supporting embodied tasks such as Mobile Manipulation and Pick-and-Place. Through agent-driven reasoning, generator orchestration, and multi-level augmentation, the framework produces interactive environments at scale for robot policy learning.
Figure 2: Overview of SAGE scene generation. Our system converts open-vocabulary text prompts into simulation-ready 3D scenes by orchestrating multiple generator tools and critics. The agent dynamically calls generators (Scene Init, Asset Placer/Mover/Remover) to construct and refine layouts, while visual and physics critics provide iterative feedback for self-improvement. The visual critic suggests semantic corrections (e.g., missing or misplaced objects), and the physics critic validates stability via Isaac Sim. For example, after applying physics critic in the bottom image, the newly added pillows on the bed fall flat. This self-improvement process ends when the agent considers that the generated scene meets the input user requirements. The resulting scenes can be further scaled via augmentation and used for embodied policy learning.
Figure 3: Common and open-vocabulary scene generation comparison. Compared with baselines, SAGE produces more complete scenes with more realistic layouts on common room types, while following the style prompts more faithfully on open-vocabulary queries.
Figure 4: Additional open-vocabulary generation. SAGE produces diverse, semantically coherent scenes spanning various styles and functionalities, from Gym and Office spaces to creative themes like "Cyberpunk game den" and "Starry-night bedroom".
Figure 5: Stability verification. Generated scenes are loaded into IsaacSim for physical validation. Both baselines exhibit displaced objects due to instability, whereas SAGE preserves scene stability before and after simulation.
...and 6 more figures

SAGE: Scalable Agentic 3D Scene Generation for Embodied AI

TL;DR

Abstract

SAGE: Scalable Agentic 3D Scene Generation for Embodied AI

Authors

TL;DR

Abstract

Table of Contents

Figures (11)