Table of Contents
Fetching ...

Orthogonal Spatial-temporal Distributional Transfer for 4D Generation

Wei Liu, Shengqiong Wu, Bobo Li, Haoyu Zhao, Hao Fei, Mong-Li Lee, Wynne Hsu

TL;DR

This work develops a spatial-temporal-disentangled 4D (STD-4D) Diffusion model, which synthesizes 4D-aware videos through disentangled spatial and temporal latents and designs a spatial-temporal-aware HexPlane to integrate the transferred spatiotemporal features, thereby improving 4D deformation and 4D Gaussian feature modeling.

Abstract

In the AIGC era, generating high-quality 4D content has garnered increasing research attention. Unfortunately, current 4D synthesis research is severely constrained by the lack of large-scale 4D datasets, preventing models from adequately learning the critical spatial-temporal features necessary for high-quality 4D generation, thus hindering progress in this domain. To combat this, we propose a novel framework that transfers rich spatial priors from existing 3D diffusion models and temporal priors from video diffusion models to enhance 4D synthesis. We develop a spatial-temporal-disentangled 4D (STD-4D) Diffusion model, which synthesizes 4D-aware videos through disentangled spatial and temporal latents. To facilitate the best feature transfer, we design a novel Orthogonal Spatial-temporal Distributional Transfer (Orster) mechanism, where the spatiotemporal feature distributions are carefully modeled and injected into the STD-4D Diffusion. Furthermore, during the 4D construction, we devise a spatial-temporal-aware HexPlane (ST-HexPlane) to integrate the transferred spatiotemporal features, thereby improving 4D deformation and 4D Gaussian feature modeling. Experiments demonstrate that our method significantly outperforms existing approaches, achieving superior spatial-temporal consistency and higher-quality 4D synthesis.

Orthogonal Spatial-temporal Distributional Transfer for 4D Generation

TL;DR

This work develops a spatial-temporal-disentangled 4D (STD-4D) Diffusion model, which synthesizes 4D-aware videos through disentangled spatial and temporal latents and designs a spatial-temporal-aware HexPlane to integrate the transferred spatiotemporal features, thereby improving 4D deformation and 4D Gaussian feature modeling.

Abstract

In the AIGC era, generating high-quality 4D content has garnered increasing research attention. Unfortunately, current 4D synthesis research is severely constrained by the lack of large-scale 4D datasets, preventing models from adequately learning the critical spatial-temporal features necessary for high-quality 4D generation, thus hindering progress in this domain. To combat this, we propose a novel framework that transfers rich spatial priors from existing 3D diffusion models and temporal priors from video diffusion models to enhance 4D synthesis. We develop a spatial-temporal-disentangled 4D (STD-4D) Diffusion model, which synthesizes 4D-aware videos through disentangled spatial and temporal latents. To facilitate the best feature transfer, we design a novel Orthogonal Spatial-temporal Distributional Transfer (Orster) mechanism, where the spatiotemporal feature distributions are carefully modeled and injected into the STD-4D Diffusion. Furthermore, during the 4D construction, we devise a spatial-temporal-aware HexPlane (ST-HexPlane) to integrate the transferred spatiotemporal features, thereby improving 4D deformation and 4D Gaussian feature modeling. Experiments demonstrate that our method significantly outperforms existing approaches, achieving superior spatial-temporal consistency and higher-quality 4D synthesis.
Paper Structure (23 sections, 15 equations, 6 figures, 3 tables)

This paper contains 23 sections, 15 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: 4D generation requires spatial and temporal coherence simultaneously. While data resources in 4D are scarce, this paper proposes transferring the 3D spatial prior and temporal prior feature learning from existing resources-rich 3D diffusion and video diffusion, respectively.
  • Figure 2: Overall pipeline of our 4D generation framework, including (a) 4D diffusion stage, and (b) 4D construction stage.
  • Figure 3: Overview of the knowledge transfer process between external 3D/video Diffusions and our STD-4D Diffusion, where STD-4D Diffusion disentangles the 4D latent into spatial and temporal channels, during which the spatial and temporal features from 3D diffusion and video diffusion are distilled into the spatial and temporal blocks of the 4D-UNet, respectively, via the Orthogonal Spatial-temporal Distributional Transfer (Orster) mechanism (cf. Fig. \ref{['fig:Orster']}).
  • Figure 4: Four-stage STD-4D Diffusion training.
  • Figure 5: A closer look at the Orthogonal Spatial-temporal Distributional Transfer (Orster) learning module.
  • ...and 1 more figures