Table of Contents
Fetching ...

AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation

Zijie Wu, Chaohui Yu, Fan Wang, Xiang Bai

TL;DR

AnimateAnyMesh tackles the challenge of text-driven animation for arbitrary 3D meshes by proposing a feed-forward architecture that sidesteps per-scene optimization. It combines DyMeshVAE, which topology-aware compresses dynamic meshes into a latent space, with a Shape-Guided Text-to-Trajectory model trained via Rectified Flow to map prompts to vertex trajectories. A large DyMesh Dataset (over 4M sequences) supports robust training and evaluation. Experiments demonstrate semantically aligned, temporally coherent mesh animations generated in seconds, significantly outperforming prior methods in quality and efficiency. The work advances practical 4D content creation and provides open data, code, and models for broader use.

Abstract

Recent advances in 4D content generation have attracted increasing attention, yet creating high-quality animated 3D models remains challenging due to the complexity of modeling spatio-temporal distributions and the scarcity of 4D training data. In this paper, we present AnimateAnyMesh, the first feed-forward framework that enables efficient text-driven animation of arbitrary 3D meshes. Our approach leverages a novel DyMeshVAE architecture that effectively compresses and reconstructs dynamic mesh sequences by disentangling spatial and temporal features while preserving local topological structures. To enable high-quality text-conditional generation, we employ a Rectified Flow-based training strategy in the compressed latent space. Additionally, we contribute the DyMesh Dataset, containing over 4M diverse dynamic mesh sequences with text annotations. Experimental results demonstrate that our method generates semantically accurate and temporally coherent mesh animations in a few seconds, significantly outperforming existing approaches in both quality and efficiency. Our work marks a substantial step forward in making 4D content creation more accessible and practical. All the data, code, and models will be open-released.

AnimateAnyMesh: A Feed-Forward 4D Foundation Model for Text-Driven Universal Mesh Animation

TL;DR

AnimateAnyMesh tackles the challenge of text-driven animation for arbitrary 3D meshes by proposing a feed-forward architecture that sidesteps per-scene optimization. It combines DyMeshVAE, which topology-aware compresses dynamic meshes into a latent space, with a Shape-Guided Text-to-Trajectory model trained via Rectified Flow to map prompts to vertex trajectories. A large DyMesh Dataset (over 4M sequences) supports robust training and evaluation. Experiments demonstrate semantically aligned, temporally coherent mesh animations generated in seconds, significantly outperforming prior methods in quality and efficiency. The work advances practical 4D content creation and provides open data, code, and models for broader use.

Abstract

Recent advances in 4D content generation have attracted increasing attention, yet creating high-quality animated 3D models remains challenging due to the complexity of modeling spatio-temporal distributions and the scarcity of 4D training data. In this paper, we present AnimateAnyMesh, the first feed-forward framework that enables efficient text-driven animation of arbitrary 3D meshes. Our approach leverages a novel DyMeshVAE architecture that effectively compresses and reconstructs dynamic mesh sequences by disentangling spatial and temporal features while preserving local topological structures. To enable high-quality text-conditional generation, we employ a Rectified Flow-based training strategy in the compressed latent space. Additionally, we contribute the DyMesh Dataset, containing over 4M diverse dynamic mesh sequences with text annotations. Experimental results demonstrate that our method generates semantically accurate and temporally coherent mesh animations in a few seconds, significantly outperforming existing approaches in both quality and efficiency. Our work marks a substantial step forward in making 4D content creation more accessible and practical. All the data, code, and models will be open-released.

Paper Structure

This paper contains 21 sections, 14 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: We present AnimateAnyMesh: the first feed-forward universal mesh animation framework that enables efficient motion generation for arbitrary 3D meshes. Given a static mesh and prompt, our method generates high-quality animations in only a few seconds.
  • Figure 2: Illustration of our proposed DyMeshVAE. Given a dynamic mesh $D$, we first extract the initial frame vertex $V_0$, the connectivity information from faces $F$, and the relative trajectories $V_T$. These information are then encoded to a decoupled latent space $\{\overline{V_0^n}, \widehat{V_T^n}\}$ via the Encoder, which features trajectory decomposition and topology-aware attention mechanisms. Then the relative trajectories $V^{rec}_T$ are reconstructed from the latent space via the Decoder. Finally, we add $V^{rec}_T$ and $V_0$ to get the reconstructed dynamic mesh.
  • Figure 3: Demonstration of divergent trajectories for nearby mesh vertices in the initial frame.
  • Figure 4: The architecture of the Shape-Guided Text-to-Trajectory Model. DVAE stands for the proposed DyMeshVAE.
  • Figure 5: Animation examples of AnimateAnyMesh. Our model demonstrates the capability to generate high-quality and semantically plausible mesh animations for arbitrary input meshes based on text prompts. Best viewed when zoomed in.
  • ...and 8 more figures