How to Train Your Dragon: Automatic Diffusion-Based Rigging for Characters with Diverse Topologies
Zeqi Gu, Difan Liu, Timothy Langlois, Matthew Fisher, Abe Davis
TL;DR
The paper tackles the challenge of animating 2D characters with diverse skeletal topologies beyond humans by leveraging diffusion models conditioned on a topology-agnostic skeleton and a rich appearance input. It introduces AniDiffusion, a two-stage training framework that first learns broad, topology-general rigging from procedurally generated synthetic data and then rapidly adapts to unseen characters with a few annotated frames. A novel skeleton representation encodes spatial and depth information to support occlusions and layering, while a large AniDiffusion Dataset provides per-frame keypoints for evaluation. Experiments show superior qualitative and quantitative performance compared to state-of-the-art pose-conditioned diffusion methods, plus robust interpolation of non-rigid deformations, enabling flexible, user-friendly rigging for a wide range of cartoon and real content.
Abstract
Recent diffusion-based methods have achieved impressive results on animating images of human subjects. However, most of that success has built on human-specific body pose representations and extensive training with labeled real videos. In this work, we extend the ability of such models to animate images of characters with more diverse skeletal topologies. Given a small number (3-5) of example frames showing the character in different poses with corresponding skeletal information, our model quickly infers a rig for that character that can generate images corresponding to new skeleton poses. We propose a procedural data generation pipeline that efficiently samples training data with diverse topologies on the fly. We use it, along with a novel skeleton representation, to train our model on articulated shapes spanning a large space of textures and topologies. Then during fine-tuning, our model rapidly adapts to unseen target characters and generalizes well to rendering new poses, both for realistic and more stylized cartoon appearances. To better evaluate performance on this novel and challenging task, we create the first 2D video dataset that contains both humanoid and non-humanoid subjects with per-frame keypoint annotations. With extensive experiments, we demonstrate the superior quality of our results. Project page: https://traindragondiffusion.github.io/
