Table of Contents
Fetching ...

CTRL-D: Controllable Dynamic 3D Scene Editing with Personalized 2D Diffusion

Kai He, Chin-Hsuan Wu, Igor Gilitschenski

TL;DR

Ctrl-D introduces a practical pipeline for controllable dynamic 3D scene editing by turning the challenge into a two-step process: personalize a 2D diffusion editor (InstructPix2Pix) with a single edited reference image, and then perform a two-stage optimization on deformable 3D Gaussians to propagate edits over time and views. The first stage densifies Gaussians in canonical space using a keyframe, while the second stage jointly optimizes the deformation field and Gaussians with an edited image buffer and a temporal loss to ensure temporal and multi-view consistency. Key contributions include (1) IP2P personalization from a single pair with prior preservation, (2) a two-stage dynamic 3D Gaussian optimization that leverages an edited image buffer for efficiency, and (3) demonstrated improvements in local edit precision, temporal coherence, and runtime over state-of-the-art baselines. The approach enables flexible, user-driven local edits in dynamic scenes and supports both monocular and multi-camera scenarios, with strong generalization to new editing domains and clear pathways for future improvements.

Abstract

Recent advances in 3D representations, such as Neural Radiance Fields and 3D Gaussian Splatting, have greatly improved realistic scene modeling and novel-view synthesis. However, achieving controllable and consistent editing in dynamic 3D scenes remains a significant challenge. Previous work is largely constrained by its editing backbones, resulting in inconsistent edits and limited controllability. In our work, we introduce a novel framework that first fine-tunes the InstructPix2Pix model, followed by a two-stage optimization of the scene based on deformable 3D Gaussians. Our fine-tuning enables the model to "learn" the editing ability from a single edited reference image, transforming the complex task of dynamic scene editing into a simple 2D image editing process. By directly learning editing regions and styles from the reference, our approach enables consistent and precise local edits without the need for tracking desired editing regions, effectively addressing key challenges in dynamic scene editing. Then, our two-stage optimization progressively edits the trained dynamic scene, using a designed edited image buffer to accelerate convergence and improve temporal consistency. Compared to state-of-the-art methods, our approach offers more flexible and controllable local scene editing, achieving high-quality and consistent results.

CTRL-D: Controllable Dynamic 3D Scene Editing with Personalized 2D Diffusion

TL;DR

Ctrl-D introduces a practical pipeline for controllable dynamic 3D scene editing by turning the challenge into a two-step process: personalize a 2D diffusion editor (InstructPix2Pix) with a single edited reference image, and then perform a two-stage optimization on deformable 3D Gaussians to propagate edits over time and views. The first stage densifies Gaussians in canonical space using a keyframe, while the second stage jointly optimizes the deformation field and Gaussians with an edited image buffer and a temporal loss to ensure temporal and multi-view consistency. Key contributions include (1) IP2P personalization from a single pair with prior preservation, (2) a two-stage dynamic 3D Gaussian optimization that leverages an edited image buffer for efficiency, and (3) demonstrated improvements in local edit precision, temporal coherence, and runtime over state-of-the-art baselines. The approach enables flexible, user-driven local edits in dynamic scenes and supports both monocular and multi-camera scenarios, with strong generalization to new editing domains and clear pathways for future improvements.

Abstract

Recent advances in 3D representations, such as Neural Radiance Fields and 3D Gaussian Splatting, have greatly improved realistic scene modeling and novel-view synthesis. However, achieving controllable and consistent editing in dynamic 3D scenes remains a significant challenge. Previous work is largely constrained by its editing backbones, resulting in inconsistent edits and limited controllability. In our work, we introduce a novel framework that first fine-tunes the InstructPix2Pix model, followed by a two-stage optimization of the scene based on deformable 3D Gaussians. Our fine-tuning enables the model to "learn" the editing ability from a single edited reference image, transforming the complex task of dynamic scene editing into a simple 2D image editing process. By directly learning editing regions and styles from the reference, our approach enables consistent and precise local edits without the need for tracking desired editing regions, effectively addressing key challenges in dynamic scene editing. Then, our two-stage optimization progressively edits the trained dynamic scene, using a designed edited image buffer to accelerate convergence and improve temporal consistency. Compared to state-of-the-art methods, our approach offers more flexible and controllable local scene editing, achieving high-quality and consistent results.

Paper Structure

This paper contains 29 sections, 6 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: We present Ctrl-D, a dynamic 3D scene editing framework that enables controllable, high-quality, consistent scene edits by editing only a single image using any 2D editing approach. Our framework is also compatible with both monocular and multi-camera scenes. Please refer to our project page for dynamic visualizations.
  • Figure 2: Our pipeline for controllable dynamic scene editing. Given a dynamic 3D scene, our method (a) first edits one frame as a reference with any 2D editing model, we then (b) fine-tune the InstructPix2Pix instructpix2pix with the edited reference image, along with sampled images from the original models to preserve the model priors, and then (c) we optimize the dynamic 3D scenes with deformable gaussian representation, using the designed 2-stage method.
  • Figure 3: Qualitative results on both monocular and multi-camera scenes. For each scene, we show two edited versions based on the original, using various 2D editing techniques to demonstrate the high fidelity, quality, and controllability of our method. The reference 2D image for each edit appears in the bottom-right or bottom-left corner.
  • Figure 4: Qualitative comparison with Instruct 4D-to-4D (IN4D) instruct_4d24d on text-driven scene editing. The reference 2D images used in our method are shown at the bottom-left of each edited scene. We also provide zoomed-in details of the edited scene on the right side. Our method demonstrates superior consistency, higher quality, and more precise local edits, which are not achievable with IN4D.
  • Figure 5: Qualitative comparison with AnyV2V ku2024anyv2v on monocular scenes. The leftmost image is the edited first frame. The prompt for our fine-tuned IP2P is "Put him in a $\texttt{<V>}$ suit". Our results demonstrate higher quality and greater consistency.
  • ...and 3 more figures