Table of Contents
Fetching ...

Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting

Inkyu Shin, Qihang Yu, Xiaohui Shen, In So Kweon, Kuk-Jin Yoon, Liang-Chieh Chen

TL;DR

The paper tackles temporal inconsistency in zero-shot video editing by introducing Video-3DGS, a two-stage framework that reconstructs dynamic monocular videos with 3D Gaussian Splatting (3DGS) and then refines edits to enhance temporal coherence. Stage 1 uses MC-COLMAP to produce clip-level foreground and background geometry and builds Frg-3DGS and Bkg-3DGS, merged via a 2D learnable alpha map to reconstruct frames. Stage 2 leverages the reconstructed scene as a plug‑and‑play refiner for various video editors, fixing geometry while updating color parameters; it also introduces Recursive and Ensembled refinement to stabilize outputs across denoising steps and guidance scales. Experiments on DAVIS and LOVEU TGVE demonstrate superior reconstruction quality (PSNR/SSIM) and improved editing consistency (WarpSSIM, $Q_{edit}$) across multiple editors, with substantial gains in efficiency. The work suggests a practical, scalable approach to per-scene temporal coherence and sets the stage for extending 3DGS to 4D video tasks.

Abstract

Recent advancements in zero-shot video diffusion models have shown promise for text-driven video editing, but challenges remain in achieving high temporal consistency. To address this, we introduce Video-3DGS, a 3D Gaussian Splatting (3DGS)-based video refiner designed to enhance temporal consistency in zero-shot video editors. Our approach utilizes a two-stage 3D Gaussian optimizing process tailored for editing dynamic monocular videos. In the first stage, Video-3DGS employs an improved version of COLMAP, referred to as MC-COLMAP, which processes original videos using a Masked and Clipped approach. For each video clip, MC-COLMAP generates the point clouds for dynamic foreground objects and complex backgrounds. These point clouds are utilized to initialize two sets of 3D Gaussians (Frg-3DGS and Bkg-3DGS) aiming to represent foreground and background views. Both foreground and background views are then merged with a 2D learnable parameter map to reconstruct full views. In the second stage, we leverage the reconstruction ability developed in the first stage to impose the temporal constraints on the video diffusion model. To demonstrate the efficacy of Video-3DGS on both stages, we conduct extensive experiments across two related tasks: Video Reconstruction and Video Editing. Video-3DGS trained with 3k iterations significantly improves video reconstruction quality (+3 PSNR, +7 PSNR increase) and training efficiency (x1.9, x4.5 times faster) over NeRF-based and 3DGS-based state-of-art methods on DAVIS dataset, respectively. Moreover, it enhances video editing by ensuring temporal consistency across 58 dynamic monocular videos.

Enhancing Temporal Consistency in Video Editing by Reconstructing Videos with 3D Gaussian Splatting

TL;DR

The paper tackles temporal inconsistency in zero-shot video editing by introducing Video-3DGS, a two-stage framework that reconstructs dynamic monocular videos with 3D Gaussian Splatting (3DGS) and then refines edits to enhance temporal coherence. Stage 1 uses MC-COLMAP to produce clip-level foreground and background geometry and builds Frg-3DGS and Bkg-3DGS, merged via a 2D learnable alpha map to reconstruct frames. Stage 2 leverages the reconstructed scene as a plug‑and‑play refiner for various video editors, fixing geometry while updating color parameters; it also introduces Recursive and Ensembled refinement to stabilize outputs across denoising steps and guidance scales. Experiments on DAVIS and LOVEU TGVE demonstrate superior reconstruction quality (PSNR/SSIM) and improved editing consistency (WarpSSIM, ) across multiple editors, with substantial gains in efficiency. The work suggests a practical, scalable approach to per-scene temporal coherence and sets the stage for extending 3DGS to 4D video tasks.

Abstract

Recent advancements in zero-shot video diffusion models have shown promise for text-driven video editing, but challenges remain in achieving high temporal consistency. To address this, we introduce Video-3DGS, a 3D Gaussian Splatting (3DGS)-based video refiner designed to enhance temporal consistency in zero-shot video editors. Our approach utilizes a two-stage 3D Gaussian optimizing process tailored for editing dynamic monocular videos. In the first stage, Video-3DGS employs an improved version of COLMAP, referred to as MC-COLMAP, which processes original videos using a Masked and Clipped approach. For each video clip, MC-COLMAP generates the point clouds for dynamic foreground objects and complex backgrounds. These point clouds are utilized to initialize two sets of 3D Gaussians (Frg-3DGS and Bkg-3DGS) aiming to represent foreground and background views. Both foreground and background views are then merged with a 2D learnable parameter map to reconstruct full views. In the second stage, we leverage the reconstruction ability developed in the first stage to impose the temporal constraints on the video diffusion model. To demonstrate the efficacy of Video-3DGS on both stages, we conduct extensive experiments across two related tasks: Video Reconstruction and Video Editing. Video-3DGS trained with 3k iterations significantly improves video reconstruction quality (+3 PSNR, +7 PSNR increase) and training efficiency (x1.9, x4.5 times faster) over NeRF-based and 3DGS-based state-of-art methods on DAVIS dataset, respectively. Moreover, it enhances video editing by ensuring temporal consistency across 58 dynamic monocular videos.
Paper Structure (35 sections, 7 equations, 12 figures, 10 tables)

This paper contains 35 sections, 7 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: The proposed Video-3DGS expands the capabilities of 3D Gaussian Splatting (3DGS) kerbl20233d to dynamic monocular video scenes, enhancing temporal consistency in both video reconstruction and video editing. For instance, it consistently captures and reconstructs dynamic objects such as riders and horses (upper section), while also enriching style smoothness in scenarios like drift-car sequences (bottom left) and ensuring structure consistency (bottom right). Regions of interest are highlighted by white dashed rectangles.
  • Figure 2: The overall pipeline of Video-3DGS. We aim to design a video-level 3D Gaussian Splatting framework to reconstruct the video scenes (1st stage), which enables high temporal consistency in video editing (2nd stage). Specifically, Video-3DGS is empowered by the proposed MC-COLMAP that effectively obtains 3D points for foreground moving objects. The background 3D points are modeled with spherical-shaped random points, surrounding the foreground points. Video-3DGS utilizes two sets of 3D Gaussians (Frg-3DGS and Bkg-3DGS) to represent foreground and background 3D points, respectively. A 2D learnable parameter map merges the foreground and background views, rendered from each set of 3D Gaussians. The merged views enable high-fidelity video reconstruction. Then, we leverage this reconstruction capability into zero-shot video editor to enhance temporal consistency while maintaining high fidelity to text prompt.
  • Figure 3: We revisit the key hyperparameters in the video diffusion process: the denoising step ($N_d$) and the guidance scale. Employing a higher denoising step combined with a lower guidance scale (e.g., similar to the image guidance scale in Text2Video-Zero text2video-zero) results in greater fidelity to the editing prompt but compromises structural and temporal consistency, and vice versa. This analysis confirms that the zero-shot video editor model is highly sensitive to these hyperparameters.
  • Figure 4: The proposed Video-3DGS (1st stage) comprises two key components. First, to effectively capture 3D point clouds and corresponding frame viewpoints from dynamic monocular videos, we introduce Masked and Clipped COLMAP (MC-COLMAP). This module spatially and temporally decomposes video frames, facilitating the extraction of clip-level foreground points through progressively processing clips. Additionally, we initialize spherical-shaped random background points conditioned on the foreground points. Second, with these two sets of point clouds, we introduce two distinct sets of 3D Gaussian Splatting (3DGS): Frg-3DGS and Bkg-3DGS, optimized separately for foreground and background points, respectively. Subsequently, we employ a straightforward merging operation to combine the rendered outputs of Frg-3DGS and Bkg-3DGS. We optimize the merged rendered outputs for each clip using the reconstruction loss.
  • Figure 5: The overview of Video-3DGS (2nd stage) as a plug-and-play refiner for video editing begins with fine-tuning the spherical coefficient of the optimized Video-3DGS on an initially edited video, which is produced using an off-the-shelf video editor with the default hyperparameters: a denoising step $N_d$ and a guidance scale. This method, referred to as the single-phase refiner with Video-3DGS, is further enhanced by our findings in \ref{['fig:revisiting']}. We split $N_d$ into a recursive number $N_r$ and fine-tune the spherical coefficient parameters against multiple outputs from varied guidance scales, aiming for improved temporal consistency and high fidelity to the editing text. This advanced approach is named the Recursive and Ensembled (RE) refinement.
  • ...and 7 more figures