Table of Contents
Fetching ...

SMF: Template-free and Rig-free Animation Transfer using Kinetic Codes

Sanjeev Muralikrishnan, Niladri Shekhar Dutt, Niloy J. Mitra

TL;DR

SMF tackles the challenge of transferring coarse motion signals to dense 3D character meshes without relying on templates or deformation rigs. It introduces Kinetic Codes, a temporally-aware latent space learned from sparse motion via a multi-headed attention autoencoder, and couples this with spatial and temporal gradient predictors and a differentiable Poisson solver to produce temporally coherent mesh sequences from a rest shape $X_0$. Temporal coherence is further enforced by an Augmented Neural ODE that predicts corrective Jacobians over motion windows, enabling robust long-sequence animation. Across AMASS, Mixamo, D4D, and monocular video, SMF demonstrates strong generalization to unseen motions and shapes, achieving state-of-the-art results on AMASS and showing realistic transfers to stylized and non-human characters, with potential for real-time applications.

Abstract

Animation retargetting applies sparse motion description (e.g., keypoint sequences) to a character mesh to produce a semantically plausible and temporally coherent full-body mesh sequence. Existing approaches come with restrictions -- they require access to template-based shape priors or artist-designed deformation rigs, suffer from limited generalization to unseen motion and/or shapes, or exhibit motion jitter. We propose Self-supervised Motion Fields (SMF), a self-supervised framework that is trained with only sparse motion representations, without requiring dataset-specific annotations, templates, or rigs. At the heart of our method are Kinetic Codes, a novel autoencoder-based sparse motion encoding, that exposes a semantically rich latent space, simplifying large-scale training. Our architecture comprises dedicated spatial and temporal gradient predictors, which are jointly trained in an end-to-end fashion. The combined network, regularized by the Kinetic Codes' latent space, has good generalization across both unseen shapes and new motions. We evaluated our method on unseen motion sampled from AMASS, D4D, Mixamo, and raw monocular video for animation transfer on various characters with varying shapes and topology. We report a new SoTA on the AMASS dataset in the context of generalization to unseen motion. Code, weights, and supplementary are available on the project webpage at https://motionfields.github.io/

SMF: Template-free and Rig-free Animation Transfer using Kinetic Codes

TL;DR

SMF tackles the challenge of transferring coarse motion signals to dense 3D character meshes without relying on templates or deformation rigs. It introduces Kinetic Codes, a temporally-aware latent space learned from sparse motion via a multi-headed attention autoencoder, and couples this with spatial and temporal gradient predictors and a differentiable Poisson solver to produce temporally coherent mesh sequences from a rest shape . Temporal coherence is further enforced by an Augmented Neural ODE that predicts corrective Jacobians over motion windows, enabling robust long-sequence animation. Across AMASS, Mixamo, D4D, and monocular video, SMF demonstrates strong generalization to unseen motions and shapes, achieving state-of-the-art results on AMASS and showing realistic transfers to stylized and non-human characters, with potential for real-time applications.

Abstract

Animation retargetting applies sparse motion description (e.g., keypoint sequences) to a character mesh to produce a semantically plausible and temporally coherent full-body mesh sequence. Existing approaches come with restrictions -- they require access to template-based shape priors or artist-designed deformation rigs, suffer from limited generalization to unseen motion and/or shapes, or exhibit motion jitter. We propose Self-supervised Motion Fields (SMF), a self-supervised framework that is trained with only sparse motion representations, without requiring dataset-specific annotations, templates, or rigs. At the heart of our method are Kinetic Codes, a novel autoencoder-based sparse motion encoding, that exposes a semantically rich latent space, simplifying large-scale training. Our architecture comprises dedicated spatial and temporal gradient predictors, which are jointly trained in an end-to-end fashion. The combined network, regularized by the Kinetic Codes' latent space, has good generalization across both unseen shapes and new motions. We evaluated our method on unseen motion sampled from AMASS, D4D, Mixamo, and raw monocular video for animation transfer on various characters with varying shapes and topology. We report a new SoTA on the AMASS dataset in the context of generalization to unseen motion. Code, weights, and supplementary are available on the project webpage at https://motionfields.github.io/

Paper Structure

This paper contains 18 sections, 13 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Method overview. We present a self-supervised learning setup to transfer sparse motion information, specified in the form of keypoints over time, to target characters producing full-body motion. Top: During training, given a motion dataset we extract sparse keypoints from the meshes and encode them to a novel Kinetic Code representation. We then train two networks to map the rest shape and the Kinetic Code to the full body motion, with only mesh-level reconstruction loss. Bottom: At inference, we drop in stylized characters (Hole Man) and unseen motion inputs to obtain full-body character animation.
  • Figure 2: Windowed Jacobian Prediction. We use attention encodings of the current window's posed Jacobians (Eq \ref{['eq:deformjac']}) and the previous window's augmented corrective Jacobians (Eq \ref{['eq:funcResidual']}) to predict the current window's augmented corrective Jacobians. These are projected to predict the current window's corrective Jacobians (Eq \ref{['eq:finalcorrective']}). These corrective's are then added to the posed Jacobians to obtain the current window's final Jacobians. We use window size $W=32$.
  • Figure 3: Unseen motion from Out-of-Distribution dataset (Mixamo) applied to in-the-wild shapes. We compare $SMF$ with NJF, TRJ, and Skeleton-free transfer on unseen dance motions (left: Hiphop; right: Shuffle) sampled from the out-of-distribution Mixamo dataset, applied to a 3D character found in-the-wild (hole man, left) and a Mixamo character (zombie, right). We modified NJF, TRJ to use keypoints instead, indicated by superscript TF. Competing methods exhibit distortion artifacts while attempting to follow the sampled source motion, while SMF (Ours) more accurately follows the sampled motion.
  • Figure 4: Unseen motion applied to in-the-wild shapes. We compare $SMF$ with NJF, TRJ, and skeleton-free transfer on unseen motion (left: Leg Backward Rotation; right: One Leg Jump), applied to in-the-wild 3D characters. Baselines often do not adhere to source motion (circled in blue) or exhibit distortion artifacts (circled in red). Our method transfers motion more accurately with far fewer shape distortion artifacts, while closely following the target motion.
  • Figure 5: Comparison of $SMF$ with baselines. We compare $SMF$ with Neural Jacobian Fields aigerman2022neural, Temporal Residual Jacobians trj, and template-free skeleton-free transfer skeletonfree. We measure the vertex-to-vertex error with ground truth and color-code the results according to the measured error. Darker red indicates higher error. $SMF$ accurately transfers the motion to the target mesh, while baselines struggle to follow the input motion and exhibit distortion artifacts.
  • ...and 6 more figures