Table of Contents
Fetching ...

Geometry-guided Online 3D Video Synthesis with Multi-View Temporal Consistency

Hyunho Ha, Lei Xiao, Christian Richardt, Thu Nguyen-Phuoc, Changil Kim, Min H. Kim, Douglas Lanman, Numair Khan

TL;DR

The paper tackles online novel-view synthesis for multi-view video with stringent view and temporal coherence. It introduces a geometry-guided pipeline that fuses temporally filtered depth into an image-space TSDF and uses that global geometry to guide a blending network that fuses forward-rendered input views. Key contributions include forward rendering with 3D Gaussian splats, an image-based TSDF depth fusion strategy with temporal filtering, and a geometry-guided four-layer U-Net for robust, consistent blending. The approach achieves state-of-the-art view- and time-consistent video synthesis while remaining efficient for online use, showing strong results across multiple challenging datasets and ablations. This framework offers practical benefits for online 3D video applications in education, conferencing, and entertainment by delivering high-quality, stable novel views with reduced computational burden.

Abstract

We introduce a novel geometry-guided online video view synthesis method with enhanced view and temporal consistency. Traditional approaches achieve high-quality synthesis from dense multi-view camera setups but require significant computational resources. In contrast, selective-input methods reduce this cost but often compromise quality, leading to multi-view and temporal inconsistencies such as flickering artifacts. Our method addresses this challenge to deliver efficient, high-quality novel-view synthesis with view and temporal consistency. The key innovation of our approach lies in using global geometry to guide an image-based rendering pipeline. To accomplish this, we progressively refine depth maps using color difference masks across time. These depth maps are then accumulated through truncated signed distance fields in the synthesized view's image space. This depth representation is view and temporally consistent, and is used to guide a pre-trained blending network that fuses multiple forward-rendered input-view images. Thus, the network is encouraged to output geometrically consistent synthesis results across multiple views and time. Our approach achieves consistent, high-quality video synthesis, while running efficiently in an online manner.

Geometry-guided Online 3D Video Synthesis with Multi-View Temporal Consistency

TL;DR

The paper tackles online novel-view synthesis for multi-view video with stringent view and temporal coherence. It introduces a geometry-guided pipeline that fuses temporally filtered depth into an image-space TSDF and uses that global geometry to guide a blending network that fuses forward-rendered input views. Key contributions include forward rendering with 3D Gaussian splats, an image-based TSDF depth fusion strategy with temporal filtering, and a geometry-guided four-layer U-Net for robust, consistent blending. The approach achieves state-of-the-art view- and time-consistent video synthesis while remaining efficient for online use, showing strong results across multiple challenging datasets and ablations. This framework offers practical benefits for online 3D video applications in education, conferencing, and entertainment by delivering high-quality, stable novel views with reduced computational burden.

Abstract

We introduce a novel geometry-guided online video view synthesis method with enhanced view and temporal consistency. Traditional approaches achieve high-quality synthesis from dense multi-view camera setups but require significant computational resources. In contrast, selective-input methods reduce this cost but often compromise quality, leading to multi-view and temporal inconsistencies such as flickering artifacts. Our method addresses this challenge to deliver efficient, high-quality novel-view synthesis with view and temporal consistency. The key innovation of our approach lies in using global geometry to guide an image-based rendering pipeline. To accomplish this, we progressively refine depth maps using color difference masks across time. These depth maps are then accumulated through truncated signed distance fields in the synthesized view's image space. This depth representation is view and temporally consistent, and is used to guide a pre-trained blending network that fuses multiple forward-rendered input-view images. Thus, the network is encouraged to output geometrically consistent synthesis results across multiple views and time. Our approach achieves consistent, high-quality video synthesis, while running efficiently in an online manner.

Paper Structure

This paper contains 14 sections, 7 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Our method enables efficient rendering of high-quality, consistent 3D videos by addressing the challenges of view synthesis in both spatial and temporal dimensions. In multi-camera setups, novel-view synthesis must manage continuous changes across view and time to ensure smooth visualization. Existing methods process frames independently and often introduce flickering artifacts. We propose using aggregated depth as geometric guidance to a blending network to maintain consistent color and detail across frames. This results in view and temporally consistent 3D videos, achieving state-of-the-art quality while providing improved processing speed over prior methods.
  • Figure 2: Pipeline overview. Given multi-view RGB-D videos, our method forward-renders a subset of views into the target camera using 3D Gaussian splatting (\ref{['subsec:pixelgs']}). The per-view depth maps are fused using a truncated signed distance field (TSDF) that is regularized to be view and temporally consistent (\ref{['subsec:tsdf']}). This geometric guidance enables a CNN to blend the forward-rendered images and inpaint disoccluded regions, to produce consistent novel view results (\ref{['subsec:blending']}).
  • Figure 3: View-temporal consistency in TSDF depth. The global TSDF provides view consistency. Temporal consistency is encouraged by filtering the input depth at consecutive frames using color difference masks for dynamic regions. Additionally, these masks are used to integrate the rendered depth from the previous frame into the TSDF, ensuring consistency across view and time.
  • Figure 4: (a) Rendering all input-view Gaussians into the target camera in a single pass zheng2024gps creates flying pixels, and disocclusion holes. (b) On the other hand, blending multiple forward-rendered images using distance-based weights mildenhall2019local leads to ghosting artifacts. (c) A naive blending network is unable to fix the ghosting. (d) Our method uses the target view's geometry -- rendered as a depth map from a TSDF -- to guide the blending network. This allows it to correctly fuse unstable forward-rendered input views.
  • Figure 5: Qualitative comparison of view consistency. We compare view consistency using epipolar plane images (EPI). We generate the EPI for each method by rendering novel views along a horizontally-translating camera path (the $v$ dimension). For a fixed image row $y$ an EPI then represents a slice of the scene in the space-view dimensions $x$–$v$, allowing a subset of points to be visualized across views as sloping lines. All baseline methods show sudden changes along EPI lines indicating view inconsistency. Our method generates smooth and continuous EPI lines, showing that it maintains view consistency as the camera translates.
  • ...and 4 more figures