Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video
David Yifan Yao, Albert J. Zhai, Shenlong Wang
TL;DR
Uni4D tackles the challenge of holistic 4D modeling from casual monocular video by unifying pretrained visual foundation models within a training-free, energy-minimization framework. It introduces a three-stage optimization that sequentially estimates camera pose, static geometry, and dynamic 3D motion, guided by cues from models for segmentation, depth, and dense tracking. The approach achieves state-of-the-art performance on pose and video depth benchmarks across synthetic and real-world datasets, while avoiding retraining and leveraging strong priors for robustness. The results demonstrate coherent, high-quality 4D reconstructions with improved temporal and spatial consistency, highlighting the value of modular foundation-model cues in dynamic scene understanding. This work paves the way for scalable, generalizable 4D modeling in unconstrained video data using existing pretrained components.
Abstract
This paper presents a unified approach to understanding dynamic scenes from casual videos. Large pretrained vision foundation models, such as vision-language, video depth prediction, motion tracking, and segmentation models, offer promising capabilities. However, training a single model for comprehensive 4D understanding remains challenging. We introduce Uni4D, a multi-stage optimization framework that harnesses multiple pretrained models to advance dynamic 3D modeling, including static/dynamic reconstruction, camera pose estimation, and dense 3D motion tracking. Our results show state-of-the-art performance in dynamic 4D modeling with superior visual quality. Notably, Uni4D requires no retraining or fine-tuning, highlighting the effectiveness of repurposing visual foundation models for 4D understanding.
