Table of Contents
Fetching ...

How to Train Your Dragon: Automatic Diffusion-Based Rigging for Characters with Diverse Topologies

Zeqi Gu, Difan Liu, Timothy Langlois, Matthew Fisher, Abe Davis

TL;DR

The paper tackles the challenge of animating 2D characters with diverse skeletal topologies beyond humans by leveraging diffusion models conditioned on a topology-agnostic skeleton and a rich appearance input. It introduces AniDiffusion, a two-stage training framework that first learns broad, topology-general rigging from procedurally generated synthetic data and then rapidly adapts to unseen characters with a few annotated frames. A novel skeleton representation encodes spatial and depth information to support occlusions and layering, while a large AniDiffusion Dataset provides per-frame keypoints for evaluation. Experiments show superior qualitative and quantitative performance compared to state-of-the-art pose-conditioned diffusion methods, plus robust interpolation of non-rigid deformations, enabling flexible, user-friendly rigging for a wide range of cartoon and real content.

Abstract

Recent diffusion-based methods have achieved impressive results on animating images of human subjects. However, most of that success has built on human-specific body pose representations and extensive training with labeled real videos. In this work, we extend the ability of such models to animate images of characters with more diverse skeletal topologies. Given a small number (3-5) of example frames showing the character in different poses with corresponding skeletal information, our model quickly infers a rig for that character that can generate images corresponding to new skeleton poses. We propose a procedural data generation pipeline that efficiently samples training data with diverse topologies on the fly. We use it, along with a novel skeleton representation, to train our model on articulated shapes spanning a large space of textures and topologies. Then during fine-tuning, our model rapidly adapts to unseen target characters and generalizes well to rendering new poses, both for realistic and more stylized cartoon appearances. To better evaluate performance on this novel and challenging task, we create the first 2D video dataset that contains both humanoid and non-humanoid subjects with per-frame keypoint annotations. With extensive experiments, we demonstrate the superior quality of our results. Project page: https://traindragondiffusion.github.io/

How to Train Your Dragon: Automatic Diffusion-Based Rigging for Characters with Diverse Topologies

TL;DR

The paper tackles the challenge of animating 2D characters with diverse skeletal topologies beyond humans by leveraging diffusion models conditioned on a topology-agnostic skeleton and a rich appearance input. It introduces AniDiffusion, a two-stage training framework that first learns broad, topology-general rigging from procedurally generated synthetic data and then rapidly adapts to unseen characters with a few annotated frames. A novel skeleton representation encodes spatial and depth information to support occlusions and layering, while a large AniDiffusion Dataset provides per-frame keypoints for evaluation. Experiments show superior qualitative and quantitative performance compared to state-of-the-art pose-conditioned diffusion methods, plus robust interpolation of non-rigid deformations, enabling flexible, user-friendly rigging for a wide range of cartoon and real content.

Abstract

Recent diffusion-based methods have achieved impressive results on animating images of human subjects. However, most of that success has built on human-specific body pose representations and extensive training with labeled real videos. In this work, we extend the ability of such models to animate images of characters with more diverse skeletal topologies. Given a small number (3-5) of example frames showing the character in different poses with corresponding skeletal information, our model quickly infers a rig for that character that can generate images corresponding to new skeleton poses. We propose a procedural data generation pipeline that efficiently samples training data with diverse topologies on the fly. We use it, along with a novel skeleton representation, to train our model on articulated shapes spanning a large space of textures and topologies. Then during fine-tuning, our model rapidly adapts to unseen target characters and generalizes well to rendering new poses, both for realistic and more stylized cartoon appearances. To better evaluate performance on this novel and challenging task, we create the first 2D video dataset that contains both humanoid and non-humanoid subjects with per-frame keypoint annotations. With extensive experiments, we demonstrate the superior quality of our results. Project page: https://traindragondiffusion.github.io/

Paper Structure

This paper contains 21 sections, 2 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Training Pipeline. (a) Our model takes in an appearance reference image and a skeleton image as inputs. (b) For the first training stage, these are randomly generated through our data pipeline. With almost infinite possible combinations of texture, shape, and topology, our synthetic dataset is more challenging than any real-life datasets, which forces our model to learn the correct binding and deformations. Our skeleton representation for this wide range of topologies is also unique: in the Red and Green channel of this RGB image, we color pixels according to their $x$ and $y$ coordinates. When a user specifies a new target pose, this skeleton is transformed accordingly, which means that the value of pixels in the target skeleton image now refer to source coordinates in the starting rest pose. We use the Blue channel to embed layer ordering of each part of the body, which is crucial for characters that contain parts of different depths. For each appearance we train the model multiple target poses and layer orderings, as shown in the two dashed boxes in (b). When the new pose causes occlusions as in the two left columns, the supervising ground truth appearance is different when the order changes. Thus, our model is forced to understand the influence of layer ordering to appearance. For more data examples please refer to Fig. \ref{['fig:more_train']}.
  • Figure 2: More Training Data Visualizations. For each canonical appearance, we show one target pose and two layer ordering examples. The red boxes highlight how ordering affects the bone colors in skeleton representations and target appearance.
  • Figure 3: AniDiffusion Dataset. We establish the first 2D animation dataset with accurate keypoint annotations, part segmentations and alpha masks (See (a), where keypoints are labeled in green). We use Adobe Character Animator to create more than 120 characters (c) with approx. 100 types of motion for each (b).
  • Figure 4: Result Visualizations. In the left most column, we show the reference image, the only two fine-tuned frames (in thumbnails). On the right we show equal-spaced consecutive frames from our model outputs. After a 25-minute fine-tuning on only these three frames, our results show impressive identity preservation, motion interpolation quality, and temporal coherency. From cartoons (row 1--4; row 3--4 are results on AniDiffusion dataset), to real life clips (row 5--6), our model works on a wide range of contents and styles. Please see our supplemental materials for more examples. Image credits (row 1,2,5,6): Fisherfield Childcare, Edina Gecse, DAVIS-2017 Pont-Tuset_arXiv_2017, Arianna1 $@$ Tenor.
  • Figure 5: Qualitative Comparisons. We compare with two top-performing pose-conditioned diffusion methods, Animate Anyone and MagicDance, and one standard editing tool powered by multiple classical deformation algorithms, Puppet Pin Tool in Adobe After Effects. The bone representations are shown on the left to the matching results, and are the same for Animate Anyone and MagicDance. Image credits (top to bottom): George Girgis, NobleDame $@$ Tenor.
  • ...and 2 more figures