Ponymation: Learning Articulated 3D Animal Motions from Unlabeled Online Videos
Keqiang Sun, Dor Litvak, Yunzhi Zhang, Hongsheng Li, Jiajun Wu, Shangzhe Wu
TL;DR
This work tackles learning a generative model of articulated 3D animal motions from unlabeled online videos, eliminating the need for pose annotations or template shapes. It introduces a video photo-geometric auto-encoding framework that uses spatio-temporal transformers to encode video clips into a motion VAE and decodes into per-frame bone poses, rendered by a differentiable renderer for end-to-end training. A category-specific SDF-based base shape plus a DINO-ViT conditioned deformation enables cross-instance registration, while semantic correspondences and a staged training regime facilitate learning motion distributions from raw videos. Results on the AnimalMotion dataset show compelling qualitative motion generation and superior quantitative metrics against a 4D diffusion baseline, enabling automatic 4D animations from a single image and broad cross-species capability with practical applications in entertainment and scientific visualization.
Abstract
We introduce a new method for learning a generative model of articulated 3D animal motions from raw, unlabeled online videos. Unlike existing approaches for 3D motion synthesis, our model requires no pose annotations or parametric shape models for training; it learns purely from a collection of unlabeled web video clips, leveraging semantic correspondences distilled from self-supervised image features. At the core of our method is a video Photo-Geometric Auto-Encoding framework that decomposes each training video clip into a set of explicit geometric and photometric representations, including a rest-pose 3D shape, an articulated pose sequence, and texture, with the objective of re-rendering the input video via a differentiable renderer. This decomposition allows us to learn a generative model over the underlying articulated pose sequences akin to a Variational Auto-Encoding (VAE) formulation, but without requiring any external pose annotations. At inference time, we can generate new motion sequences by sampling from the learned motion VAE, and create plausible 4D animations of an animal automatically within seconds given a single input image.
