Table of Contents
Fetching ...

SceneReVis: A Self-Reflective Vision-Grounded Framework for 3D Indoor Scene Synthesis via Multi-turn RL

Yang Zhao, Shizhao Sun, Meisheng Zhang, Yingdong Shi, Xubo Yang, Jiang Bian

TL;DR

SceneReVis introduces a vision-grounded, self-reflective framework for 3D indoor scene synthesis that uses a diagnose-and-act loop to intercept spatial conflicts during generation. It reframes scene construction as a multi-turn POMDP and pairs a reverse-engineered dataset (SceneChain-12k) with a two-stage training curriculum (SFT followed by Agentic RL using GRPO) to become an active spatial planner. The approach achieves state-of-the-art results in standard, long-tail, and goal-oriented generation tasks, thanks to a hybrid reward structure that couples dense feedback with a final quality assessment. Its combination of visual grounding, iterative reasoning, and long-horizon planning enables robust generalization and practical utility, including support for irregular floor plans.

Abstract

Current one-pass 3D scene synthesis methods often suffer from spatial hallucinations, such as collisions, due to a lack of deliberative reasoning. To bridge this gap, we introduce SceneReVis, a vision-grounded self-reflection framework that employs an iterative ``diagnose-and-act'' loop to explicitly intercept and resolve spatial conflicts using multi-modal feedback. To support this step-wise paradigm, we construct SceneChain-12k, a large-scale dataset of causal construction trajectories derived through a novel reverse engineering pipeline. We further propose a two-stage training recipe that transitions from Supervised Fine-Tuning to Agentic Reinforcement Learning, evolving the model into an active spatial planner. Extensive experiments demonstrate that SceneReVis achieves state-of-the-art performance in high-fidelity generation and goal-oriented optimization, with robust generalization to long-tail domains.

SceneReVis: A Self-Reflective Vision-Grounded Framework for 3D Indoor Scene Synthesis via Multi-turn RL

TL;DR

SceneReVis introduces a vision-grounded, self-reflective framework for 3D indoor scene synthesis that uses a diagnose-and-act loop to intercept spatial conflicts during generation. It reframes scene construction as a multi-turn POMDP and pairs a reverse-engineered dataset (SceneChain-12k) with a two-stage training curriculum (SFT followed by Agentic RL using GRPO) to become an active spatial planner. The approach achieves state-of-the-art results in standard, long-tail, and goal-oriented generation tasks, thanks to a hybrid reward structure that couples dense feedback with a final quality assessment. Its combination of visual grounding, iterative reasoning, and long-horizon planning enables robust generalization and practical utility, including support for irregular floor plans.

Abstract

Current one-pass 3D scene synthesis methods often suffer from spatial hallucinations, such as collisions, due to a lack of deliberative reasoning. To bridge this gap, we introduce SceneReVis, a vision-grounded self-reflection framework that employs an iterative ``diagnose-and-act'' loop to explicitly intercept and resolve spatial conflicts using multi-modal feedback. To support this step-wise paradigm, we construct SceneChain-12k, a large-scale dataset of causal construction trajectories derived through a novel reverse engineering pipeline. We further propose a two-stage training recipe that transitions from Supervised Fine-Tuning to Agentic Reinforcement Learning, evolving the model into an active spatial planner. Extensive experiments demonstrate that SceneReVis achieves state-of-the-art performance in high-fidelity generation and goal-oriented optimization, with robust generalization to long-tail domains.
Paper Structure (63 sections, 6 equations, 6 figures, 6 tables, 2 algorithms)

This paper contains 63 sections, 6 equations, 6 figures, 6 tables, 2 algorithms.

Figures (6)

  • Figure 1: Comparison of generation paradigms.(a) One-Pass Generation lacks intermediate reasoning, leading to severe physical violations. (b) Post-Processing Generation tends to be trapped in local optima, yielding suboptimal visual and semantic quality. (c) SceneReVis (Ours) employs an self-reflection paradigm to ensure physical plausibility and aesthetic coherence.
  • Figure 2: Overview of the SceneReVis learning framework. The pipeline consists of three stages: (a) Data Construction: We employ a Reverse Engineering strategy to decompose static 3D scenes into causal construction trajectories (SceneChain-12k). (b) Cold Start: A Supervised Fine-Tuning (SFT) stage initializes the agent with basic tool usage capabilities. (c) Legend: Legend explaining the contents of various parts. (d) Agentic RL: The core stage utilizes Group Relative Policy Optimization (GRPO). The agent interacts with a physics-enabled simulator via a "diagnose-and-act" loop, receiving multi-modal feedback and optimizing against non-differentiable objectives.
  • Figure 3: Qualitative comparison of standard scenes. We visualize the 3D scenes generated by different methods for Bedroom (top) and Living Room (bottom) scenarios based on the given text prompts.
  • Figure 5: Qualitative results for goal-oriented scene optimization. We visualize optimization trajectories under three conditions—Chaotic & Missing, Chaotic Only, and Missing Only.
  • Figure 6: Additional Qualitative Comparisons.
  • ...and 1 more figures