Table of Contents
Fetching ...

AnyTop: Character Animation Diffusion with Any Topology

Inbar Gat, Sigal Raab, Guy Tevet, Yuval Reshef, Amit H. Bermano, Daniel Cohen-Or

TL;DR

AnyTop tackles motion synthesis for arbitrary skeletal topologies using a diffusion model with a transformer-based denoiser conditioned on skeletal structure and joint descriptions. The architecture combines an Enrichment Block and Skeletal Temporal Transformer with a Topological Conditioning Scheme that injects graph topology into attention, enabling cross-skeleton generalization including unseen skeletons. The model learns a rich latent space via diffusion features (DIFT) and supports downstream tasks like joint correspondence, temporal segmentation, and motion editing, demonstrated on the Truebones dataset. The results show strong generalization with few training examples, and ablations confirm the importance of topology embeddings and per-joint representations.

Abstract

Generating motion for arbitrary skeletons is a longstanding challenge in computer graphics, remaining largely unexplored due to the scarcity of diverse datasets and the irregular nature of the data. In this work, we introduce AnyTop, a diffusion model that generates motions for diverse characters with distinct motion dynamics, using only their skeletal structure as input. Our work features a transformer-based denoising network, tailored for arbitrary skeleton learning, integrating topology information into the traditional attention mechanism. Additionally, by incorporating textual joint descriptions into the latent feature representation, AnyTop learns semantic correspondences between joints across diverse skeletons. Our evaluation demonstrates that AnyTop generalizes well, even with as few as three training examples per topology, and can produce motions for unseen skeletons as well. Furthermore, our model's latent space is highly informative, enabling downstream tasks such as joint correspondence, temporal segmentation and motion editing. Our webpage, https://anytop2025.github.io/Anytop-page, includes links to videos and code.

AnyTop: Character Animation Diffusion with Any Topology

TL;DR

AnyTop tackles motion synthesis for arbitrary skeletal topologies using a diffusion model with a transformer-based denoiser conditioned on skeletal structure and joint descriptions. The architecture combines an Enrichment Block and Skeletal Temporal Transformer with a Topological Conditioning Scheme that injects graph topology into attention, enabling cross-skeleton generalization including unseen skeletons. The model learns a rich latent space via diffusion features (DIFT) and supports downstream tasks like joint correspondence, temporal segmentation, and motion editing, demonstrated on the Truebones dataset. The results show strong generalization with few training examples, and ablations confirm the importance of topology embeddings and per-joint representations.

Abstract

Generating motion for arbitrary skeletons is a longstanding challenge in computer graphics, remaining largely unexplored due to the scarcity of diverse datasets and the irregular nature of the data. In this work, we introduce AnyTop, a diffusion model that generates motions for diverse characters with distinct motion dynamics, using only their skeletal structure as input. Our work features a transformer-based denoising network, tailored for arbitrary skeleton learning, integrating topology information into the traditional attention mechanism. Additionally, by incorporating textual joint descriptions into the latent feature representation, AnyTop learns semantic correspondences between joints across diverse skeletons. Our evaluation demonstrates that AnyTop generalizes well, even with as few as three training examples per topology, and can produce motions for unseen skeletons as well. Furthermore, our model's latent space is highly informative, enabling downstream tasks such as joint correspondence, temporal segmentation and motion editing. Our webpage, https://anytop2025.github.io/Anytop-page, includes links to videos and code.

Paper Structure

This paper contains 48 sections, 6 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Overview. The input to AnyTop is a noised motion $X_t$ and the skeleton $\mathcal{S}=\{\mathcal{P}_{\mathcal{S}}, \mathcal{R}_{\mathcal{S}}, \mathcal{D}_{\mathcal{S}},\mathcal{N}_{\mathcal{S}}\}$, where $\mathcal{P}_{\mathcal{S}}$ refers to the rest-pose, $\mathcal{R}_{\mathcal{S}}$ denotes joints' relations, $\mathcal{D}_{\mathcal{S}}$ defines topological distances between each pair of joints and $\mathcal{N}_{\mathcal{S}}$ denotes joint names. The Enrichment Block incorporates the skeletal features into the noised motion by concatenating the embedded $\mathcal{P}_{\mathcal{S}}$ to the sequence as an additional temporal token, and adding a T5-embedded name to each joint. The enriched motion is then passed through a stack of L Skeletal Temporal Transformer layers. We apply skeletal attention along the joint axis to capture interactions between all joints, and incorporate topology information $\mathcal{R}_{\mathcal{S}}$ and $\mathcal{D}_{\mathcal{S}}$ to attention maps. Next, we apply temporal attention along the frame axis. Finally, the output is projected back to the motion features dimension, facilitating the reconstruction of the motion sequence.
  • Figure 2: Topological Conditions. Joint relations $\mathcal{R}_{\mathcal{S}}$ (top) and graph distances $\mathcal{D}_{\mathcal{S}}$ (bottom), visualized for a specific joint marked in red. Different colors indicate different values in the row corresponding to the visualized joint in the $\mathcal{R}_{\mathcal{S}}, \mathcal{D}_{\mathcal{S}}$ matrices.
  • Figure 3: Spatial Correspondence. Monkey (top left) depicts the reference skeleton, while the fox, scorpion, and bird depict different target skeletons. Target skeleton joints are color-coded to match their corresponding joints in the reference. For better visualization, we color the bones to match their adjacent joints. Note the correspondence in limbs, spine, and tail.
  • Figure 4: Temporal Correspondence. Monkey (top row) features the reference motion, while the Crab and Lynx represent two target motions. The frames of the targets are color-coded to align with their corresponding reference frames. Note the correspondence: aggressive motion segments are pink, idle frames blue, and transitional frames green.
  • Figure 5: In-skeleton Generalization. The top row depicts two ground truth chicken motions: pecking (left) and walking (right). The bottom row presents synthesized motions of an adapted SinMDM (left) and AnyTop (right). The emphasized frames in AnyTop demonstrate spatial composition of walking and pecking, introducing novel poses not present in the ground truth. SinMDM embeds entire poses, hence cannot spatially-compose joints.
  • ...and 5 more figures