Table of Contents
Fetching ...

MvDrag3D: Drag-based Creative 3D Editing via Multi-view Generation-Reconstruction Priors

Honghua Chen, Yushi Lan, Yongwei Chen, Yifan Zhou, Xingang Pan

TL;DR

MVDrag3D is introduced, a novel framework for more flexible and creative drag-based 3D editing that leverages multi-view generation and reconstruction priors and proposes a multi-view score function that distills generative priors from multiple views to further enhance the view consistency and visual quality.

Abstract

Drag-based editing has become popular in 2D content creation, driven by the capabilities of image generative models. However, extending this technique to 3D remains a challenge. Existing 3D drag-based editing methods, whether employing explicit spatial transformations or relying on implicit latent optimization within limited-capacity 3D generative models, fall short in handling significant topology changes or generating new textures across diverse object categories. To overcome these limitations, we introduce MVDrag3D, a novel framework for more flexible and creative drag-based 3D editing that leverages multi-view generation and reconstruction priors. At the core of our approach is the usage of a multi-view diffusion model as a strong generative prior to perform consistent drag editing over multiple rendered views, which is followed by a reconstruction model that reconstructs 3D Gaussians of the edited object. While the initial 3D Gaussians may suffer from misalignment between different views, we address this via view-specific deformation networks that adjust the position of Gaussians to be well aligned. In addition, we propose a multi-view score function that distills generative priors from multiple views to further enhance the view consistency and visual quality. Extensive experiments demonstrate that MVDrag3D provides a precise, generative, and flexible solution for 3D drag-based editing, supporting more versatile editing effects across various object categories and 3D representations.

MvDrag3D: Drag-based Creative 3D Editing via Multi-view Generation-Reconstruction Priors

TL;DR

MVDrag3D is introduced, a novel framework for more flexible and creative drag-based 3D editing that leverages multi-view generation and reconstruction priors and proposes a multi-view score function that distills generative priors from multiple views to further enhance the view consistency and visual quality.

Abstract

Drag-based editing has become popular in 2D content creation, driven by the capabilities of image generative models. However, extending this technique to 3D remains a challenge. Existing 3D drag-based editing methods, whether employing explicit spatial transformations or relying on implicit latent optimization within limited-capacity 3D generative models, fall short in handling significant topology changes or generating new textures across diverse object categories. To overcome these limitations, we introduce MVDrag3D, a novel framework for more flexible and creative drag-based 3D editing that leverages multi-view generation and reconstruction priors. At the core of our approach is the usage of a multi-view diffusion model as a strong generative prior to perform consistent drag editing over multiple rendered views, which is followed by a reconstruction model that reconstructs 3D Gaussians of the edited object. While the initial 3D Gaussians may suffer from misalignment between different views, we address this via view-specific deformation networks that adjust the position of Gaussians to be well aligned. In addition, we propose a multi-view score function that distills generative priors from multiple views to further enhance the view consistency and visual quality. Extensive experiments demonstrate that MVDrag3D provides a precise, generative, and flexible solution for 3D drag-based editing, supporting more versatile editing effects across various object categories and 3D representations.

Paper Structure

This paper contains 20 sections, 5 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Comparison of our MVDrag3D with state-of-the-art approaches. The first two rows present results of dragging on meshes, while the last two focus on 3D Gaussians. Notably, APAP yoo2024plausible is specifically designed for mesh structures, and thus, it was not tested on 3D Gaussians. Overall, our method demonstrates the ability to produce more plausible and generative editing results, showing better performance across both 3D Gaussians and meshes.
  • Figure 2: Method overview. Given a 3D model and multiple pairs of 3D dragging points, we first render the model into four orthogonal views, each with corresponding projected dragging points. Then, to ensure consistent dragging across these views, we define a multi-view guidance energy within a multi-view diffusion model. The resulting dragged images are used to regress an initial set of 3D Gaussians. Our method further employs a two-stage optimization process: first, a deformation network adjusts the positions of the Gaussians for improved geometric alignment, followed by image-conditioned multi-view score distillation to enhance the visual quality of the final output.
  • Figure 3: Effect of DDIM inversion with random noise. For the rendered four images, when inverted into MVDream's data distribution, the resulting noise deviates from a Gaussian distribution (b). By adding random noise ($\mathcal{N}(0, 0.01)$) to the background's pixel domain, we help the latent variables conform more closely to a Gaussian distribution (c). The resulting multi-view edits are shown in (d) and (e). Yellow arrows indicate the views with evident identity changes.
  • Figure 4: Effect of Gaussian position optimization. (c) shows 3D reconstruction result may exhibit structural misalignment. By employing a deformation network to optimize the Gaussian position, we achieve better compactness and consistency among the Gaussians across different views, as shown in (d).
  • Figure 5: Effect of image-conditioned multi-view SDS. (c) presents the reconstruction results without appearance optimization, while (d) displays the corresponding results after optimization, which are noticeably sharper and clearer.
  • ...and 5 more figures