EX-4D: EXtreme Viewpoint 4D Video Synthesis via Depth Watertight Mesh
Tao Hu, Haoyang Peng, Xiao Liu, Yuewen Ma
TL;DR
EX-4D addresses the challenge of generating camera-controllable 4D videos from monocular input at extreme viewpoints by introducing Depth Watertight Mesh (DW-Mesh) to model both visible and occluded geometry. It pairs DW-Mesh with a simulated masking strategy to create training data without multi-view datasets and a lightweight LoRA-based video diffusion adapter to ensure temporal coherence and physical consistency. Empirical results show state-of-the-art performance across quantitative metrics and user studies, especially as viewpoint angles become more extreme. This approach enables practical 4D video synthesis from monocular videos with reduced data requirements and efficient training, expanding possibilities for free-viewpoint video and immersive applications.
Abstract
Generating high-quality camera-controllable videos from monocular input is a challenging task, particularly under extreme viewpoint. Existing methods often struggle with geometric inconsistencies and occlusion artifacts in boundaries, leading to degraded visual quality. In this paper, we introduce EX-4D, a novel framework that addresses these challenges through a Depth Watertight Mesh representation. The representation serves as a robust geometric prior by explicitly modeling both visible and occluded regions, ensuring geometric consistency in extreme camera pose. To overcome the lack of paired multi-view datasets, we propose a simulated masking strategy that generates effective training data only from monocular videos. Additionally, a lightweight LoRA-based video diffusion adapter is employed to synthesize high-quality, physically consistent, and temporally coherent videos. Extensive experiments demonstrate that EX-4D outperforms state-of-the-art methods in terms of physical consistency and extreme-view quality, enabling practical 4D video generation.
