Table of Contents
Fetching ...

AR4D: Autoregressive 4D Generation from Monocular Videos

Hanxin Zhu, Tianyu He, Xiqian Yu, Junliang Guo, Zhibo Chen, Jiang Bian

TL;DR

AR4D tackles the limitations of SDS-based 4D generation from monocular videos by introducing a SDS-free, three-stage pipeline. It initializes a canonical 3D Gaussian space from pre-trained 3D generators, then autoregressively generates subsequent frames with per-frame local deformations aided by progressive view sampling, and finally refines the results with a global deformation field to curb appearance drift. The approach achieves state-of-the-art results on video-to-4D and text-to-4D tasks, offering greater diversity, improved spatial-temporal consistency, and better alignment with input prompts. By leveraging 3D Gaussian Splatting and large-scale 3D priors, AR4D provides a practical and scalable pathway for high-quality dynamic 4D content from monocular footage.

Abstract

Recent advancements in generative models have ignited substantial interest in dynamic 3D content creation (\ie, 4D generation). Existing approaches primarily rely on Score Distillation Sampling (SDS) to infer novel-view videos, typically leading to issues such as limited diversity, spatial-temporal inconsistency and poor prompt alignment, due to the inherent randomness of SDS. To tackle these problems, we propose AR4D, a novel paradigm for SDS-free 4D generation. Specifically, our paradigm consists of three stages. To begin with, for a monocular video that is either generated or captured, we first utilize pre-trained expert models to create a 3D representation of the first frame, which is further fine-tuned to serve as the canonical space. Subsequently, motivated by the fact that videos happen naturally in an autoregressive manner, we propose to generate each frame's 3D representation based on its previous frame's representation, as this autoregressive generation manner can facilitate more accurate geometry and motion estimation. Meanwhile, to prevent overfitting during this process, we introduce a progressive view sampling strategy, utilizing priors from pre-trained large-scale 3D reconstruction models. To avoid appearance drift introduced by autoregressive generation, we further incorporate a refinement stage based on a global deformation field and the geometry of each frame's 3D representation. Extensive experiments have demonstrated that AR4D can achieve state-of-the-art 4D generation without SDS, delivering greater diversity, improved spatial-temporal consistency, and better alignment with input prompts.

AR4D: Autoregressive 4D Generation from Monocular Videos

TL;DR

AR4D tackles the limitations of SDS-based 4D generation from monocular videos by introducing a SDS-free, three-stage pipeline. It initializes a canonical 3D Gaussian space from pre-trained 3D generators, then autoregressively generates subsequent frames with per-frame local deformations aided by progressive view sampling, and finally refines the results with a global deformation field to curb appearance drift. The approach achieves state-of-the-art results on video-to-4D and text-to-4D tasks, offering greater diversity, improved spatial-temporal consistency, and better alignment with input prompts. By leveraging 3D Gaussian Splatting and large-scale 3D priors, AR4D provides a practical and scalable pathway for high-quality dynamic 4D content from monocular footage.

Abstract

Recent advancements in generative models have ignited substantial interest in dynamic 3D content creation (\ie, 4D generation). Existing approaches primarily rely on Score Distillation Sampling (SDS) to infer novel-view videos, typically leading to issues such as limited diversity, spatial-temporal inconsistency and poor prompt alignment, due to the inherent randomness of SDS. To tackle these problems, we propose AR4D, a novel paradigm for SDS-free 4D generation. Specifically, our paradigm consists of three stages. To begin with, for a monocular video that is either generated or captured, we first utilize pre-trained expert models to create a 3D representation of the first frame, which is further fine-tuned to serve as the canonical space. Subsequently, motivated by the fact that videos happen naturally in an autoregressive manner, we propose to generate each frame's 3D representation based on its previous frame's representation, as this autoregressive generation manner can facilitate more accurate geometry and motion estimation. Meanwhile, to prevent overfitting during this process, we introduce a progressive view sampling strategy, utilizing priors from pre-trained large-scale 3D reconstruction models. To avoid appearance drift introduced by autoregressive generation, we further incorporate a refinement stage based on a global deformation field and the geometry of each frame's 3D representation. Extensive experiments have demonstrated that AR4D can achieve state-of-the-art 4D generation without SDS, delivering greater diversity, improved spatial-temporal consistency, and better alignment with input prompts.
Paper Structure (22 sections, 11 equations, 9 figures, 3 tables)

This paper contains 22 sections, 11 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Illustration of autoregressive 4D generation. In comparison to SDS-based methods (e.g., Consistent4D jiang2023consistent4d), our approach enables SDS-free 4D generation with substantial advancements, including better alignment with input videos and improved spatial-temporal consistency, etc.
  • Figure 2: Paradigm of our proposed AR4D. To enable SDS-free 4D generation, we propose a three-stage approach consisting of Initialization, Generation, and Refinement. Please see Sec. \ref{['sec:methods']} for more details.
  • Figure 3: Ablation studies on finetuning the 3D Gaussians in the Initialization stage reveal that finetuning can capture finer texture details in the reference frame, enhancing the quality of subsequent generation.
  • Figure 4: Ablation studies on whether applying autoregressive 4D generation and progressive view sampling strategy in the Generation stage. With both of them, we can achieve the best performance.
  • Figure 5: Results of the Refinement stage demonstrate its effectiveness in addressing appearance drift. While appearance may fluctuate, the geometry (evident in the consistent depth map) remains stable, enabling the generation of spatial-temporal consistent 4D contents.
  • ...and 4 more figures