Table of Contents
Fetching ...

EX-4D: EXtreme Viewpoint 4D Video Synthesis via Depth Watertight Mesh

Tao Hu, Haoyang Peng, Xiao Liu, Yuewen Ma

TL;DR

EX-4D addresses the challenge of generating camera-controllable 4D videos from monocular input at extreme viewpoints by introducing Depth Watertight Mesh (DW-Mesh) to model both visible and occluded geometry. It pairs DW-Mesh with a simulated masking strategy to create training data without multi-view datasets and a lightweight LoRA-based video diffusion adapter to ensure temporal coherence and physical consistency. Empirical results show state-of-the-art performance across quantitative metrics and user studies, especially as viewpoint angles become more extreme. This approach enables practical 4D video synthesis from monocular videos with reduced data requirements and efficient training, expanding possibilities for free-viewpoint video and immersive applications.

Abstract

Generating high-quality camera-controllable videos from monocular input is a challenging task, particularly under extreme viewpoint. Existing methods often struggle with geometric inconsistencies and occlusion artifacts in boundaries, leading to degraded visual quality. In this paper, we introduce EX-4D, a novel framework that addresses these challenges through a Depth Watertight Mesh representation. The representation serves as a robust geometric prior by explicitly modeling both visible and occluded regions, ensuring geometric consistency in extreme camera pose. To overcome the lack of paired multi-view datasets, we propose a simulated masking strategy that generates effective training data only from monocular videos. Additionally, a lightweight LoRA-based video diffusion adapter is employed to synthesize high-quality, physically consistent, and temporally coherent videos. Extensive experiments demonstrate that EX-4D outperforms state-of-the-art methods in terms of physical consistency and extreme-view quality, enabling practical 4D video generation.

EX-4D: EXtreme Viewpoint 4D Video Synthesis via Depth Watertight Mesh

TL;DR

EX-4D addresses the challenge of generating camera-controllable 4D videos from monocular input at extreme viewpoints by introducing Depth Watertight Mesh (DW-Mesh) to model both visible and occluded geometry. It pairs DW-Mesh with a simulated masking strategy to create training data without multi-view datasets and a lightweight LoRA-based video diffusion adapter to ensure temporal coherence and physical consistency. Empirical results show state-of-the-art performance across quantitative metrics and user studies, especially as viewpoint angles become more extreme. This approach enables practical 4D video synthesis from monocular videos with reduced data requirements and efficient training, expanding possibilities for free-viewpoint video and immersive applications.

Abstract

Generating high-quality camera-controllable videos from monocular input is a challenging task, particularly under extreme viewpoint. Existing methods often struggle with geometric inconsistencies and occlusion artifacts in boundaries, leading to degraded visual quality. In this paper, we introduce EX-4D, a novel framework that addresses these challenges through a Depth Watertight Mesh representation. The representation serves as a robust geometric prior by explicitly modeling both visible and occluded regions, ensuring geometric consistency in extreme camera pose. To overcome the lack of paired multi-view datasets, we propose a simulated masking strategy that generates effective training data only from monocular videos. Additionally, a lightweight LoRA-based video diffusion adapter is employed to synthesize high-quality, physically consistent, and temporally coherent videos. Extensive experiments demonstrate that EX-4D outperforms state-of-the-art methods in terms of physical consistency and extreme-view quality, enabling practical 4D video generation.

Paper Structure

This paper contains 37 sections, 5 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Our EX-4D framework takes a monocular video as input and generates high-quality 4D videos under extreme viewpoint. By leveraging the proposed Depth Watertight Mesh representation, it effectively handles occlusions in boundaries and ensures geometric consistency, enabling visually coherent and realistic results.
  • Figure 2: Illustration of DW-Mesh construction. (a) Ground Truth: The original scene with complete geometry. (b) Visible Mesh (2D View): 3D reconstructed visible mesh representation showing only the visible regions. (c) Visible Mesh (3D View): 3D visualization of the visible mesh, highlighting missing occluded regions with wrong physical boundaries. (d) DW-Mesh Construction: Our proposed Depth Watertight Mesh explicitly models both visible and occluded regions, ensuring geometric consistency. (e) DW-Mesh (2D View): 2D representation of the DW-Mesh, showing the inclusion of occluded areas. (f) DW-Mesh (3D View): 3D visualization of the DW-Mesh, demonstrating its watertight structure and ability to handle occlusions in boundaries to maintain physical consistency.
  • Figure 3: Illustration of our mask generation methods. Top Row: Input Monocular Video; Middle Row: Rendering Mask Generation uses DW-Mesh to simulate occlusions that would occur in novel viewpoints; Bottom Row: Tracking Mask Generation preserves temporal consistency by tracking points across frames and marking consistent occlusion patterns.
  • Figure 4: Overview of the EX-4D framework. Our approach transforms monocular videos into extreme viewpoint 4D videos through three key components: (1) Depth Watertight Mesh construction, which explicitly models both visible and occluded regions; (2) Color and mask videos are simulated or rendered for training or inference separately; and (3) a lightweight LoRA-based video diffusion adapter that ensures geometric consistency and temporal coherence in the synthesized 4D videos.
  • Figure 5: Qualitative comparison of our EX-4D against state-of-the-art approaches under extreme viewpoint. Our approach produces physically consistent videos with effective occlusion handling and temporal coherence. In contrast, baseline methods exhibit artifacts such as Physical Inconsistency and Wrong Occlusion due to their limited ability to model hidden geometry, or suffer from Severe Viewpoint Deviation in novel scenes outside their training data distribution.
  • ...and 11 more figures