Table of Contents
Fetching ...

Ponymation: Learning Articulated 3D Animal Motions from Unlabeled Online Videos

Keqiang Sun, Dor Litvak, Yunzhi Zhang, Hongsheng Li, Jiajun Wu, Shangzhe Wu

TL;DR

This work tackles learning a generative model of articulated 3D animal motions from unlabeled online videos, eliminating the need for pose annotations or template shapes. It introduces a video photo-geometric auto-encoding framework that uses spatio-temporal transformers to encode video clips into a motion VAE and decodes into per-frame bone poses, rendered by a differentiable renderer for end-to-end training. A category-specific SDF-based base shape plus a DINO-ViT conditioned deformation enables cross-instance registration, while semantic correspondences and a staged training regime facilitate learning motion distributions from raw videos. Results on the AnimalMotion dataset show compelling qualitative motion generation and superior quantitative metrics against a 4D diffusion baseline, enabling automatic 4D animations from a single image and broad cross-species capability with practical applications in entertainment and scientific visualization.

Abstract

We introduce a new method for learning a generative model of articulated 3D animal motions from raw, unlabeled online videos. Unlike existing approaches for 3D motion synthesis, our model requires no pose annotations or parametric shape models for training; it learns purely from a collection of unlabeled web video clips, leveraging semantic correspondences distilled from self-supervised image features. At the core of our method is a video Photo-Geometric Auto-Encoding framework that decomposes each training video clip into a set of explicit geometric and photometric representations, including a rest-pose 3D shape, an articulated pose sequence, and texture, with the objective of re-rendering the input video via a differentiable renderer. This decomposition allows us to learn a generative model over the underlying articulated pose sequences akin to a Variational Auto-Encoding (VAE) formulation, but without requiring any external pose annotations. At inference time, we can generate new motion sequences by sampling from the learned motion VAE, and create plausible 4D animations of an animal automatically within seconds given a single input image.

Ponymation: Learning Articulated 3D Animal Motions from Unlabeled Online Videos

TL;DR

This work tackles learning a generative model of articulated 3D animal motions from unlabeled online videos, eliminating the need for pose annotations or template shapes. It introduces a video photo-geometric auto-encoding framework that uses spatio-temporal transformers to encode video clips into a motion VAE and decodes into per-frame bone poses, rendered by a differentiable renderer for end-to-end training. A category-specific SDF-based base shape plus a DINO-ViT conditioned deformation enables cross-instance registration, while semantic correspondences and a staged training regime facilitate learning motion distributions from raw videos. Results on the AnimalMotion dataset show compelling qualitative motion generation and superior quantitative metrics against a 4D diffusion baseline, enabling automatic 4D animations from a single image and broad cross-species capability with practical applications in entertainment and scientific visualization.

Abstract

We introduce a new method for learning a generative model of articulated 3D animal motions from raw, unlabeled online videos. Unlike existing approaches for 3D motion synthesis, our model requires no pose annotations or parametric shape models for training; it learns purely from a collection of unlabeled web video clips, leveraging semantic correspondences distilled from self-supervised image features. At the core of our method is a video Photo-Geometric Auto-Encoding framework that decomposes each training video clip into a set of explicit geometric and photometric representations, including a rest-pose 3D shape, an articulated pose sequence, and texture, with the objective of re-rendering the input video via a differentiable renderer. This decomposition allows us to learn a generative model over the underlying articulated pose sequences akin to a Variational Auto-Encoding (VAE) formulation, but without requiring any external pose annotations. At inference time, we can generate new motion sequences by sampling from the learned motion VAE, and create plausible 4D animations of an animal automatically within seconds given a single input image.
Paper Structure (44 sections, 10 equations, 9 figures, 8 tables)

This paper contains 44 sections, 10 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Learning 3D Animal Motions from Unlabeled Online Videos. Given a collection of monocular videos of an animal category sourced from the Internet as training data, our method learns a generative model of the articulated 3D motions together with a monocular 3D reconstruction model, without relying on any shape templates or pose annotations. At inference time, the model generates new 3D motion sequences and turns a single test image in 4D animations fully automatically.
  • Figure 2: Training Pipeline. Our method learns a generative model of articulated 3D motion sequences from a collection of unlabeled monocular videos. During training, the model encodes an input video sequence $I_{1:T}$ into a latent code $z$ in the motion VAE, and decodes from it a sequence of articulated 3D poses $\hat{\xi}_{1:T}$. This pose sequence is used animate the reconstructed 3D shape, allowing the full pipeline to be trained simply using image reconstruction losses with unsupervised image features and object masks obtained from off-the-shelf models, without any external pose annotations.
  • Figure 3: 3D Motion Generation and Animation. During test time, our model generates plausible 3D motion sequences by sampling from the learned motion VAE. It can also reconstruct articulated 3D shapes from a single 2D image in feed-forward fashion, and generate 4D animations fully automatically within seconds. Within each gray box on the right, the first row shows textured animation, and the second row visualizes the corresponding 3D shapes with the generated bone articulations.
  • Figure 4: 4D Generation Comparisons. We compare with 4D-fy bahmani20234dfy, a recent text-to-4D generation method distilling from 2D diffusion. Despite heavy prompt engineering and a lengthy training time (12 hours), 4D-fy still fails to produce noticeable motion, whereas our model generates diverse motion sequences in a feed-forward pass within a few seconds, with much better 3D geometry.
  • Figure 5: 3D Motion Generation Results on More Species. Our method can be trained on various animal species, such as corws, zebras, and giraffes illustrated here. The model learns to generate 3D motions and generate plausible motion sequences specific to the animal species, such as the generated neck motion in the first example which is more common in giraffes than others.
  • ...and 4 more figures