Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

Jiyuan Wang; Chunyu Lin; Lei Sun; Zhi Cao; Yuyang Yin; Lang Nie; Zhenlong Yuan; Xiangxiang Chu; Yunchao Wei; Kang Liao; Guosheng Lin

Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

Jiyuan Wang, Chunyu Lin, Lei Sun, Zhi Cao, Yuyang Yin, Lang Nie, Zhenlong Yuan, Xiangxiang Chu, Yunchao Wei, Kang Liao, Guosheng Lin

TL;DR

This paper proposes RL3DEdit, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT, that achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency.

Abstract

Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective training strategy for editing tasks, infeasible. In this paper, we observe that, while generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturally positioning reinforcement learning (RL) as a feasible solution. Motivated by this, we propose \textbf{RL3DEdit}, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT. Specifically, we leverage VGGT's robust priors learned from massive real-world data, feed the edited images, and utilize the output confidence maps and pose estimation errors as reward signals, effectively anchoring the 2D editing priors onto a 3D-consistent manifold via RL. Extensive experiments demonstrate that RL3DEdit achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency. To promote the development of 3D editing, we will release the code and model.

Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

TL;DR

Abstract

Paper Structure (16 sections, 6 equations, 8 figures, 2 tables)

This paper contains 16 sections, 6 equations, 8 figures, 2 tables.

Introduction
Related Work
2D Image Editing Models
3D Editing Models
Reinforcement Learning for 3D Tasks
Methods
3D Editing Pipeline with Reinforcement Learning
Multi-Image Joint Editing
Multi-View Consistent Verification
Overview of Reward Model
Experiments
Implementation Details
Comparison Analysis
Ablation Study
Limitations and Future Work
...and 1 more sections

Figures (8)

Figure 1: We propose RL3DEdit, a novel RL-based model for single-pass 3D editing. Our method achieves high-quality results across diverse editing scenarios, including ①motion edits, ②subject replacement, ③style transfer, ④background changes, and ⑤challenging scene addition.
Figure 2: Pipeline of RL3DEdit. Section \ref{['sec:grpo_pipeline']} details the pipeline.
Figure 3: Comparison of 2D editing capabilities before and after RL fine-tuning. Left: Visual editing results. Right: Quantitative evaluation using VIEScore viescore on GEdit-Bench-EN geditbench_en ($\uparrow$, detailed in Sec. \ref{['sec:metric']}). Both demonstrate that RL3DEdit successfully preserves the original 2D editing fidelity of FLUX-Kontext.
Figure 4: Multi-image joint editing comparison. FLUX-Kontext and Qwen-Image-Edit successfully swap the fur colors, while InstructPix2Pix fails due to the lack of cross-view interaction. Moreover, InstructPix2Pix must resize images to low resolution, causing detail loss in multi-image scenarios.
Figure 5: Empirical analysis of VGGT's depth confidence under progressively degraded 3D consistency. ①-⑤ visualize VGGT confidence predictions for the same set of 9 views, where individual views are gradually replaced by edited versions. ⑥ reveals a near-linear correlation between consistency degradation and average confidence. This validates VGGT as the multi-view consistency verifier. Detailed analysis is in Sec. \ref{['sec:verifier']}.
...and 3 more figures

Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

TL;DR

Abstract

Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing

Authors

TL;DR

Abstract

Table of Contents

Figures (8)