Table of Contents
Fetching ...

Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

Jiyuan Wang, Chunyu Lin, Lei Sun, Zhi Cao, Yuyang Yin, Lang Nie, Zhenlong Yuan, Xiangxiang Chu, Yunchao Wei, Kang Liao, Guosheng Lin

TL;DR

This paper proposes RL3DEdit, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT, that achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency.

Abstract

Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective training strategy for editing tasks, infeasible. In this paper, we observe that, while generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturally positioning reinforcement learning (RL) as a feasible solution. Motivated by this, we propose \textbf{RL3DEdit}, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT. Specifically, we leverage VGGT's robust priors learned from massive real-world data, feed the edited images, and utilize the output confidence maps and pose estimation errors as reward signals, effectively anchoring the 2D editing priors onto a 3D-consistent manifold via RL. Extensive experiments demonstrate that RL3DEdit achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency. To promote the development of 3D editing, we will release the code and model.

Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

TL;DR

This paper proposes RL3DEdit, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT, that achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency.

Abstract

Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective training strategy for editing tasks, infeasible. In this paper, we observe that, while generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturally positioning reinforcement learning (RL) as a feasible solution. Motivated by this, we propose \textbf{RL3DEdit}, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT. Specifically, we leverage VGGT's robust priors learned from massive real-world data, feed the edited images, and utilize the output confidence maps and pose estimation errors as reward signals, effectively anchoring the 2D editing priors onto a 3D-consistent manifold via RL. Extensive experiments demonstrate that RL3DEdit achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency. To promote the development of 3D editing, we will release the code and model.
Paper Structure (16 sections, 6 equations, 8 figures, 2 tables)

This paper contains 16 sections, 6 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: We propose RL3DEdit, a novel RL-based model for single-pass 3D editing. Our method achieves high-quality results across diverse editing scenarios, including ①motion edits, ②subject replacement, ③style transfer, ④background changes, and ⑤challenging scene addition.
  • Figure 2: Pipeline of RL3DEdit. Section \ref{['sec:grpo_pipeline']} details the pipeline.
  • Figure 3: Comparison of 2D editing capabilities before and after RL fine-tuning. Left: Visual editing results. Right: Quantitative evaluation using VIEScore viescore on GEdit-Bench-EN geditbench_en ($\uparrow$, detailed in Sec. \ref{['sec:metric']}). Both demonstrate that RL3DEdit successfully preserves the original 2D editing fidelity of FLUX-Kontext.
  • Figure 4: Multi-image joint editing comparison. FLUX-Kontext and Qwen-Image-Edit successfully swap the fur colors, while InstructPix2Pix fails due to the lack of cross-view interaction. Moreover, InstructPix2Pix must resize images to low resolution, causing detail loss in multi-image scenarios.
  • Figure 5: Empirical analysis of VGGT's depth confidence under progressively degraded 3D consistency. ①-⑤ visualize VGGT confidence predictions for the same set of 9 views, where individual views are gradually replaced by edited versions. ⑥ reveals a near-linear correlation between consistency degradation and average confidence. This validates VGGT as the multi-view consistency verifier. Detailed analysis is in Sec. \ref{['sec:verifier']}.
  • ...and 3 more figures