Table of Contents
Fetching ...

Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video

David Yifan Yao, Albert J. Zhai, Shenlong Wang

TL;DR

Uni4D tackles the challenge of holistic 4D modeling from casual monocular video by unifying pretrained visual foundation models within a training-free, energy-minimization framework. It introduces a three-stage optimization that sequentially estimates camera pose, static geometry, and dynamic 3D motion, guided by cues from models for segmentation, depth, and dense tracking. The approach achieves state-of-the-art performance on pose and video depth benchmarks across synthetic and real-world datasets, while avoiding retraining and leveraging strong priors for robustness. The results demonstrate coherent, high-quality 4D reconstructions with improved temporal and spatial consistency, highlighting the value of modular foundation-model cues in dynamic scene understanding. This work paves the way for scalable, generalizable 4D modeling in unconstrained video data using existing pretrained components.

Abstract

This paper presents a unified approach to understanding dynamic scenes from casual videos. Large pretrained vision foundation models, such as vision-language, video depth prediction, motion tracking, and segmentation models, offer promising capabilities. However, training a single model for comprehensive 4D understanding remains challenging. We introduce Uni4D, a multi-stage optimization framework that harnesses multiple pretrained models to advance dynamic 3D modeling, including static/dynamic reconstruction, camera pose estimation, and dense 3D motion tracking. Our results show state-of-the-art performance in dynamic 4D modeling with superior visual quality. Notably, Uni4D requires no retraining or fine-tuning, highlighting the effectiveness of repurposing visual foundation models for 4D understanding.

Uni4D: Unifying Visual Foundation Models for 4D Modeling from a Single Video

TL;DR

Uni4D tackles the challenge of holistic 4D modeling from casual monocular video by unifying pretrained visual foundation models within a training-free, energy-minimization framework. It introduces a three-stage optimization that sequentially estimates camera pose, static geometry, and dynamic 3D motion, guided by cues from models for segmentation, depth, and dense tracking. The approach achieves state-of-the-art performance on pose and video depth benchmarks across synthetic and real-world datasets, while avoiding retraining and leveraging strong priors for robustness. The results demonstrate coherent, high-quality 4D reconstructions with improved temporal and spatial consistency, highlighting the value of modular foundation-model cues in dynamic scene understanding. This work paves the way for scalable, generalizable 4D modeling in unconstrained video data using existing pretrained components.

Abstract

This paper presents a unified approach to understanding dynamic scenes from casual videos. Large pretrained vision foundation models, such as vision-language, video depth prediction, motion tracking, and segmentation models, offer promising capabilities. However, training a single model for comprehensive 4D understanding remains challenging. We introduce Uni4D, a multi-stage optimization framework that harnesses multiple pretrained models to advance dynamic 3D modeling, including static/dynamic reconstruction, camera pose estimation, and dense 3D motion tracking. Our results show state-of-the-art performance in dynamic 4D modeling with superior visual quality. Notably, Uni4D requires no retraining or fine-tuning, highlighting the effectiveness of repurposing visual foundation models for 4D understanding.

Paper Structure

This paper contains 42 sections, 11 equations, 16 figures, 11 tables.

Figures (16)

  • Figure 1: Given a casually captured video, Uni4D harnesses pretrained visual foundation models and multi-stage optimization to jointly estimate camera poses, dynamic geometry, and dense 3D motion. The resulting camera poses and geometry are accurate, consistent, and coherent both temporally and spatially. This is all done without any additional training or fine-tuning.
  • Figure 2: Uni4D outperforms other recent 4D modeling methods in both camera pose and geometry accuracy on the Sintel dataset.
  • Figure 3: Given a casually captured video, Uni4D exploits visual foundation models to extract dynamic segmentation, video depth, and motion tracks. Static geometry and poses are obtained through tracklet-based structure-from-motion along with camera motion priors. Dynamic geometry is improved through nonrigid bundle adjustment and scene motion priors. A final fusion densifies geometry to obtain high quality 4D reconstruction.
  • Figure 4: Qualitative results of 4D reconstruction on the DAVIS perazzi2016benchmark dataset. CasualSAM zhang2022structure suffers from slanted geometry, and Monst3R zhang2024monst3r has unclear geometry and does not resolve conflicts from multiple views (note the wall in the bird's-eye view). Both CasualSAM and Monst3R lack clean dynamic reconstruction and segmentation. Uni4D achieves a realistic layout, thanks to joint optimization, and provides accurate dynamic segmentation and reconstruction by leveraging foundation visual models as cues.
  • Figure 5: Qualitative results on Bonn dataset. Both CasualSAM and MonST3R have trailing artifacts and incorrect dynamic estimations. Uni4D provides clear dynamic and static geometry.
  • ...and 11 more figures