Table of Contents
Fetching ...

Animate3D: Animating Any 3D Model with Multi-view Video Diffusion

Yanqin Jiang, Chaohui Yu, Chenjie Cao, Fan Wang, Weiming Hu, Jin Gao

TL;DR

The paper tackles the challenge of animating static 3D models into coherent 4D content by addressing spatiotemporal inconsistency and multi-view preservation. It introduces MV-VDM, a multi-view image-conditioned diffusion model with a novel spatiotemporal attention module and MV2V-Adapter, trained on the MV-Video dataset of 115K animations. A two-stage pipeline then reconstructs motion via 4D Gaussian Splatting and refines it with 4D-SDS, enabling mesh animation without rigging. Experiments against state-of-the-art show strong improvements in alignment with the input object, motion quality, and appearance, with practical implications for accessing high-quality dynamic 3D content.

Abstract

Recent advances in 4D generation mainly focus on generating 4D content by distilling pre-trained text or single-view image-conditioned models. It is inconvenient for them to take advantage of various off-the-shelf 3D assets with multi-view attributes, and their results suffer from spatiotemporal inconsistency owing to the inherent ambiguity in the supervision signals. In this work, we present Animate3D, a novel framework for animating any static 3D model. The core idea is two-fold: 1) We propose a novel multi-view video diffusion model (MV-VDM) conditioned on multi-view renderings of the static 3D object, which is trained on our presented large-scale multi-view video dataset (MV-Video). 2) Based on MV-VDM, we introduce a framework combining reconstruction and 4D Score Distillation Sampling (4D-SDS) to leverage the multi-view video diffusion priors for animating 3D objects. Specifically, for MV-VDM, we design a new spatiotemporal attention module to enhance spatial and temporal consistency by integrating 3D and video diffusion models. Additionally, we leverage the static 3D model's multi-view renderings as conditions to preserve its identity. For animating 3D models, an effective two-stage pipeline is proposed: we first reconstruct motions directly from generated multi-view videos, followed by the introduced 4D-SDS to refine both appearance and motion. Benefiting from accurate motion learning, we could achieve straightforward mesh animation. Qualitative and quantitative experiments demonstrate that Animate3D significantly outperforms previous approaches. Data, code, and models will be open-released.

Animate3D: Animating Any 3D Model with Multi-view Video Diffusion

TL;DR

The paper tackles the challenge of animating static 3D models into coherent 4D content by addressing spatiotemporal inconsistency and multi-view preservation. It introduces MV-VDM, a multi-view image-conditioned diffusion model with a novel spatiotemporal attention module and MV2V-Adapter, trained on the MV-Video dataset of 115K animations. A two-stage pipeline then reconstructs motion via 4D Gaussian Splatting and refines it with 4D-SDS, enabling mesh animation without rigging. Experiments against state-of-the-art show strong improvements in alignment with the input object, motion quality, and appearance, with practical implications for accessing high-quality dynamic 3D content.

Abstract

Recent advances in 4D generation mainly focus on generating 4D content by distilling pre-trained text or single-view image-conditioned models. It is inconvenient for them to take advantage of various off-the-shelf 3D assets with multi-view attributes, and their results suffer from spatiotemporal inconsistency owing to the inherent ambiguity in the supervision signals. In this work, we present Animate3D, a novel framework for animating any static 3D model. The core idea is two-fold: 1) We propose a novel multi-view video diffusion model (MV-VDM) conditioned on multi-view renderings of the static 3D object, which is trained on our presented large-scale multi-view video dataset (MV-Video). 2) Based on MV-VDM, we introduce a framework combining reconstruction and 4D Score Distillation Sampling (4D-SDS) to leverage the multi-view video diffusion priors for animating 3D objects. Specifically, for MV-VDM, we design a new spatiotemporal attention module to enhance spatial and temporal consistency by integrating 3D and video diffusion models. Additionally, we leverage the static 3D model's multi-view renderings as conditions to preserve its identity. For animating 3D models, an effective two-stage pipeline is proposed: we first reconstruct motions directly from generated multi-view videos, followed by the introduced 4D-SDS to refine both appearance and motion. Benefiting from accurate motion learning, we could achieve straightforward mesh animation. Qualitative and quantitative experiments demonstrate that Animate3D significantly outperforms previous approaches. Data, code, and models will be open-released.
Paper Structure (23 sections, 11 equations, 12 figures, 3 tables)

This paper contains 23 sections, 11 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Different supervision for 4D generation. MV-VDM shows superior spatiotemporal consistency than previous models. Based on MV-VDM, we propose Animate3D to animate any 3D model.
  • Figure 2: Illustration of our proposed multi-view video diffusion model---MV-VDM (upper part) and our Animate3D framework (lower part). MV-VDM, trained on our presented large-scale 4D dataset MV-Video, can generate spatiotemporal consistent multi-view videos. Animate3D, based on MV-VDM, combines reconstruction and 4D-SDS optimization to animate any static 3D models.
  • Figure 3: Qualitative comparison with state-of-the-art methods. Best viewed by zooming in.
  • Figure 4: Ablation for multi-view video diffusion.
  • Figure 5: Ablation for 3D object animation. Best viewed by zooming in.
  • ...and 7 more figures