Table of Contents
Fetching ...

3DEgo: 3D Editing on the Go!

Umar Khalid, Hasan Iqbal, Azib Farooq, Jing Hua, Chen Chen

TL;DR

3DEgo tackles the problem of converting monocular videos into photorealistic, text-guided 3D scenes without COLMAP pose estimation or initial unedited models. It introduces a COLMAP-free, single-stage pipeline that first performs autoregressive, multi-view-consistent 2D editing with a diffusion model and a noise blender, then reconstructs the scene using 3D Gaussian Splatting with Gaussians $h=\{\\mu, \\Sigma, c, \\alpha, m\}$ and a KEA identity vector $m$ guided by losses $L_{rgb}, L_{KEA}, L_{ipc}, L_{pc}$. A two-stage training process—relative pose initialization and global 3D scene expansion with progressive densification—enables accurate pose estimation and coherent 3D growth, validated by extensive experiments on six datasets including GS25. The results show fast, precise, and adaptable editing across diverse video sources, highlighting 3DEgo’s potential to democratize 3D content creation from casual footage, while acknowledging current diffusion-model limitations in edge-case edits.

Abstract

We introduce 3DEgo to address a novel problem of directly synthesizing photorealistic 3D scenes from monocular videos guided by textual prompts. Conventional methods construct a text-conditioned 3D scene through a three-stage process, involving pose estimation using Structure-from-Motion (SfM) libraries like COLMAP, initializing the 3D model with unedited images, and iteratively updating the dataset with edited images to achieve a 3D scene with text fidelity. Our framework streamlines the conventional multi-stage 3D editing process into a single-stage workflow by overcoming the reliance on COLMAP and eliminating the cost of model initialization. We apply a diffusion model to edit video frames prior to 3D scene creation by incorporating our designed noise blender module for enhancing multi-view editing consistency, a step that does not require additional training or fine-tuning of T2I diffusion models. 3DEgo utilizes 3D Gaussian Splatting to create 3D scenes from the multi-view consistent edited frames, capitalizing on the inherent temporal continuity and explicit point cloud data. 3DEgo demonstrates remarkable editing precision, speed, and adaptability across a variety of video sources, as validated by extensive evaluations on six datasets, including our own prepared GS25 dataset. Project Page: https://3dego.github.io/

3DEgo: 3D Editing on the Go!

TL;DR

3DEgo tackles the problem of converting monocular videos into photorealistic, text-guided 3D scenes without COLMAP pose estimation or initial unedited models. It introduces a COLMAP-free, single-stage pipeline that first performs autoregressive, multi-view-consistent 2D editing with a diffusion model and a noise blender, then reconstructs the scene using 3D Gaussian Splatting with Gaussians and a KEA identity vector guided by losses . A two-stage training process—relative pose initialization and global 3D scene expansion with progressive densification—enables accurate pose estimation and coherent 3D growth, validated by extensive experiments on six datasets including GS25. The results show fast, precise, and adaptable editing across diverse video sources, highlighting 3DEgo’s potential to democratize 3D content creation from casual footage, while acknowledging current diffusion-model limitations in edge-case edits.

Abstract

We introduce 3DEgo to address a novel problem of directly synthesizing photorealistic 3D scenes from monocular videos guided by textual prompts. Conventional methods construct a text-conditioned 3D scene through a three-stage process, involving pose estimation using Structure-from-Motion (SfM) libraries like COLMAP, initializing the 3D model with unedited images, and iteratively updating the dataset with edited images to achieve a 3D scene with text fidelity. Our framework streamlines the conventional multi-stage 3D editing process into a single-stage workflow by overcoming the reliance on COLMAP and eliminating the cost of model initialization. We apply a diffusion model to edit video frames prior to 3D scene creation by incorporating our designed noise blender module for enhancing multi-view editing consistency, a step that does not require additional training or fine-tuning of T2I diffusion models. 3DEgo utilizes 3D Gaussian Splatting to create 3D scenes from the multi-view consistent edited frames, capitalizing on the inherent temporal continuity and explicit point cloud data. 3DEgo demonstrates remarkable editing precision, speed, and adaptability across a variety of video sources, as validated by extensive evaluations on six datasets, including our own prepared GS25 dataset. Project Page: https://3dego.github.io/
Paper Structure (18 sections, 13 equations, 7 figures, 4 tables)

This paper contains 18 sections, 13 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Our method, 3DEgo, streamlines the 3D editing process by merging a three-stage workflow into a singular, comprehensive framework. This efficiency is achieved by bypassing the need for COLMAP schonberger2016structure for pose initialization and avoiding the initialization of the model with unedited images, unlike other existing approaches haque2023instructdong2024vicakim2023collaborative.
  • Figure 2: 3DEgo offers rapid, accurate, and adaptable 3D editing, bypassing the need for original 3D scene initialization and COLMAP poses. This ensures compatibility with videos from any source, including casual smartphone captures like the Van 360-degree scene. The above results identify three cases challenging for IN2N haque2023instruct, where our method can convert a monocular video into customized 3D scenes using a streamlined, single-stage reconstruction process.
  • Figure 3: Autoregressive Editing. At each denoising step, the model predicts $w+1$ separate noises, which are then unified via weighted noise blender (Eq. \ref{['noise_blender']}) to predict ${\varepsilon}_\theta (e_t, f, \mathcal{T}, W)$.
  • Figure 4: Qualitative comparison of our method with the IN2N haque2023instruct over two separate scenes. When the editing prompt requests "Give the wheels Blue Color and Make the recyclebins brown," IN2N haque2023instruct inadvertently alters the complete van color to blue as well, instead of just changing the tire color. It must be noted that IN2N haque2023instruct uses poses from COLMAP, while 3DEgo estimates poses while constructing the 3D scene.
  • Figure 5: Our approach surpasses Gaussian Grouping ye2023gaussian in 3D object elimination across different scenes from GS25 and Tanks & Temple datasets. 3DEgo is capable of eliminating substantial objects like statues from the entire scene while significantly minimizing artifacts and avoiding a blurred background.
  • ...and 2 more figures