Geometry-guided Online 3D Video Synthesis with Multi-View Temporal Consistency
Hyunho Ha, Lei Xiao, Christian Richardt, Thu Nguyen-Phuoc, Changil Kim, Min H. Kim, Douglas Lanman, Numair Khan
TL;DR
The paper tackles online novel-view synthesis for multi-view video with stringent view and temporal coherence. It introduces a geometry-guided pipeline that fuses temporally filtered depth into an image-space TSDF and uses that global geometry to guide a blending network that fuses forward-rendered input views. Key contributions include forward rendering with 3D Gaussian splats, an image-based TSDF depth fusion strategy with temporal filtering, and a geometry-guided four-layer U-Net for robust, consistent blending. The approach achieves state-of-the-art view- and time-consistent video synthesis while remaining efficient for online use, showing strong results across multiple challenging datasets and ablations. This framework offers practical benefits for online 3D video applications in education, conferencing, and entertainment by delivering high-quality, stable novel views with reduced computational burden.
Abstract
We introduce a novel geometry-guided online video view synthesis method with enhanced view and temporal consistency. Traditional approaches achieve high-quality synthesis from dense multi-view camera setups but require significant computational resources. In contrast, selective-input methods reduce this cost but often compromise quality, leading to multi-view and temporal inconsistencies such as flickering artifacts. Our method addresses this challenge to deliver efficient, high-quality novel-view synthesis with view and temporal consistency. The key innovation of our approach lies in using global geometry to guide an image-based rendering pipeline. To accomplish this, we progressively refine depth maps using color difference masks across time. These depth maps are then accumulated through truncated signed distance fields in the synthesized view's image space. This depth representation is view and temporally consistent, and is used to guide a pre-trained blending network that fuses multiple forward-rendered input-view images. Thus, the network is encouraged to output geometrically consistent synthesis results across multiple views and time. Our approach achieves consistent, high-quality video synthesis, while running efficiently in an online manner.
