Table of Contents
Fetching ...

Puppeteer: Rig and Animate Your 3D Models

Chaoyue Song, Xiu Li, Fan Yang, Zhongcong Xu, Jiacheng Wei, Fayao Liu, Jiashi Feng, Guosheng Lin, Jianfeng Zhang

TL;DR

Puppeteer tackles the bottleneck of converting static 3D models into animated assets by introducing a unified rigging-and-animation pipeline. It combines a shape-conditioned, auto-regressive skeleton generator with a topology-aware skinning predictor and a differentiable, video-guided animation module, trained on an expanded Articulation-XL2.0 dataset. The approach achieves state-of-the-art performance in skeleton fidelity, skinning accuracy, and animation stability across diverse asset categories, while offering efficient inference and robust generalization to AI-generated content. This work paves the way for end-to-end automated 3D animation that spans game assets, films, and interactive media, reducing manual effort and enabling broader creator access.

Abstract

Modern interactive applications increasingly demand dynamic 3D content, yet the transformation of static 3D models into animated assets constitutes a significant bottleneck in content creation pipelines. While recent advances in generative AI have revolutionized static 3D model creation, rigging and animation continue to depend heavily on expert intervention. We present Puppeteer, a comprehensive framework that addresses both automatic rigging and animation for diverse 3D objects. Our system first predicts plausible skeletal structures via an auto-regressive transformer that introduces a joint-based tokenization strategy for compact representation and a hierarchical ordering methodology with stochastic perturbation that enhances bidirectional learning capabilities. It then infers skinning weights via an attention-based architecture incorporating topology-aware joint attention that explicitly encodes inter-joint relationships based on skeletal graph distances. Finally, we complement these rigging advances with a differentiable optimization-based animation pipeline that generates stable, high-fidelity animations while being computationally more efficient than existing approaches. Extensive evaluations across multiple benchmarks demonstrate that our method significantly outperforms state-of-the-art techniques in both skeletal prediction accuracy and skinning quality. The system robustly processes diverse 3D content, ranging from professionally designed game assets to AI-generated shapes, producing temporally coherent animations that eliminate the jittering issues common in existing methods.

Puppeteer: Rig and Animate Your 3D Models

TL;DR

Puppeteer tackles the bottleneck of converting static 3D models into animated assets by introducing a unified rigging-and-animation pipeline. It combines a shape-conditioned, auto-regressive skeleton generator with a topology-aware skinning predictor and a differentiable, video-guided animation module, trained on an expanded Articulation-XL2.0 dataset. The approach achieves state-of-the-art performance in skeleton fidelity, skinning accuracy, and animation stability across diverse asset categories, while offering efficient inference and robust generalization to AI-generated content. This work paves the way for end-to-end automated 3D animation that spans game assets, films, and interactive media, reducing manual effort and enabling broader creator access.

Abstract

Modern interactive applications increasingly demand dynamic 3D content, yet the transformation of static 3D models into animated assets constitutes a significant bottleneck in content creation pipelines. While recent advances in generative AI have revolutionized static 3D model creation, rigging and animation continue to depend heavily on expert intervention. We present Puppeteer, a comprehensive framework that addresses both automatic rigging and animation for diverse 3D objects. Our system first predicts plausible skeletal structures via an auto-regressive transformer that introduces a joint-based tokenization strategy for compact representation and a hierarchical ordering methodology with stochastic perturbation that enhances bidirectional learning capabilities. It then infers skinning weights via an attention-based architecture incorporating topology-aware joint attention that explicitly encodes inter-joint relationships based on skeletal graph distances. Finally, we complement these rigging advances with a differentiable optimization-based animation pipeline that generates stable, high-fidelity animations while being computationally more efficient than existing approaches. Extensive evaluations across multiple benchmarks demonstrate that our method significantly outperforms state-of-the-art techniques in both skeletal prediction accuracy and skinning quality. The system robustly processes diverse 3D content, ranging from professionally designed game assets to AI-generated shapes, producing temporally coherent animations that eliminate the jittering issues common in existing methods.

Paper Structure

This paper contains 24 sections, 8 equations, 13 figures, 10 tables.

Figures (13)

  • Figure 1: Given a 3D model, we apply automatic rigging to create a skeleton structure with skinning weights. The input mesh is then rendered as input for video generation models klingaijimengai2025. Finally, we produce animations guided by the generated videos. The input 3D models are generated by hunyuan3d22025tencent.
  • Figure 2: Overview of our automatic rigging pipeline. Given a 3D mesh, we first sample point clouds with normals, then generate a skeleton using an auto-regressive transformer. The point clouds and skeleton are processed through an attention-based network with four key operations: bone feature enhancement via topology-aware joint attention, global context integration through cross-attention with shape latents, bone-point interaction via cross-attention, and point feature refinement. Finally, cosine similarity and softmax normalization produce the skinning weights.
  • Figure 3: Qualitative skeleton generation results. The data is from Articulation-XL2.0, ModelsResource, and the diverse-pose subset from top to bottom.
  • Figure 4: Comparison of skeleton results on generated meshes. The meshes are generated by Tripo 2.0 tripo3d and Hunyuan3D 2.0 hunyuan3d22025tencent.
  • Figure 5: Qualitative skinning weight prediction results. The data is from Articulation-XL2.0, ModelsResource, and the diverse-pose subset from top to bottom. Each example shows the predicted weight visualization alongside its L1 error map. Additional results are provided in the appendix.
  • ...and 8 more figures