Table of Contents
Fetching ...

X-MoGen: Unified Motion Generation across Humans and Animals

Xuan Wang, Kai Ruan, Liyang Qian, Zhizhi Guo, Chang Su, Gaoang Wang

TL;DR

X-MoGen introduces a unified cross-species, text-driven motion generation framework that spans humans and animals. It combines a two-stage pipeline—Stage 1 learns a species-conditioned T-pose prior with a Conditional Graph Variational Autoencoder and encodes motions via an Autoencoder, while Stage 2 uses masked motion modeling with a diffusion-based head guided by text and a Morphological Consistency Module—to ensure anatomically plausible motions across diverse morphologies. The UniMo4D dataset standardizes skeletal topology across 115 species and enables joint training, supporting robust generalization to unseen species. Empirical results show X-MoGen achieving state-of-the-art performance in realism, text alignment, and morphological consistency on both seen and unseen species, with demonstrated capabilities in universal motion generation and cross-species transformation. This work advances cross-species animation and simulation by enabling unified, controllable motion generation from natural language prompts across a broad biological spectrum.

Abstract

Text-driven motion generation has attracted increasing attention due to its broad applications in virtual reality, animation, and robotics. While existing methods typically model human and animal motion separately, a joint cross-species approach offers key advantages, such as a unified representation and improved generalization. However, morphological differences across species remain a key challenge, often compromising motion plausibility. To address this, we propose X-MoGen, the first unified framework for cross-species text-driven motion generation covering both humans and animals. X-MoGen adopts a two-stage architecture. First, a conditional graph variational autoencoder learns canonical T-pose priors, while an autoencoder encodes motion into a shared latent space regularized by morphological loss. In the second stage, we perform masked motion modeling to generate motion embeddings conditioned on textual descriptions. During training, a morphological consistency module is employed to promote skeletal plausibility across species. To support unified modeling, we construct UniMo4D, a large-scale dataset of 115 species and 119k motion sequences, which integrates human and animal motions under a shared skeletal topology for joint training. Extensive experiments on UniMo4D demonstrate that X-MoGen outperforms state-of-the-art methods on both seen and unseen species.

X-MoGen: Unified Motion Generation across Humans and Animals

TL;DR

X-MoGen introduces a unified cross-species, text-driven motion generation framework that spans humans and animals. It combines a two-stage pipeline—Stage 1 learns a species-conditioned T-pose prior with a Conditional Graph Variational Autoencoder and encodes motions via an Autoencoder, while Stage 2 uses masked motion modeling with a diffusion-based head guided by text and a Morphological Consistency Module—to ensure anatomically plausible motions across diverse morphologies. The UniMo4D dataset standardizes skeletal topology across 115 species and enables joint training, supporting robust generalization to unseen species. Empirical results show X-MoGen achieving state-of-the-art performance in realism, text alignment, and morphological consistency on both seen and unseen species, with demonstrated capabilities in universal motion generation and cross-species transformation. This work advances cross-species animation and simulation by enabling unified, controllable motion generation from natural language prompts across a broad biological spectrum.

Abstract

Text-driven motion generation has attracted increasing attention due to its broad applications in virtual reality, animation, and robotics. While existing methods typically model human and animal motion separately, a joint cross-species approach offers key advantages, such as a unified representation and improved generalization. However, morphological differences across species remain a key challenge, often compromising motion plausibility. To address this, we propose X-MoGen, the first unified framework for cross-species text-driven motion generation covering both humans and animals. X-MoGen adopts a two-stage architecture. First, a conditional graph variational autoencoder learns canonical T-pose priors, while an autoencoder encodes motion into a shared latent space regularized by morphological loss. In the second stage, we perform masked motion modeling to generate motion embeddings conditioned on textual descriptions. During training, a morphological consistency module is employed to promote skeletal plausibility across species. To support unified modeling, we construct UniMo4D, a large-scale dataset of 115 species and 119k motion sequences, which integrates human and animal motions under a shared skeletal topology for joint training. Extensive experiments on UniMo4D demonstrate that X-MoGen outperforms state-of-the-art methods on both seen and unseen species.

Paper Structure

This paper contains 26 sections, 12 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: X-MoGen achieves a wide range of capabilities within a single unified framework, including generating both human and animal motions from text descriptions and enabling smooth cross-species motion transitions.
  • Figure 2: Overview of the X-MoGen architecture. Our two-stage framework first learns a T-pose prior and a compact latent motion space via a Conditional Graph Autoencoder (CGAE) and an Autoencoder (AE). In the second stage, a Masked Transformer (M-Trans) conditions a diffusion model to generate motion from noise, guided by the text description and species-specific T-pose priors produced by the CGAE. An auxiliary MCM promotes morphological constraints in the generated motions.
  • Figure 3: Statistics of the UniMo4D dataset. (a) Species distribution. (b) Length distribution of key bones.
  • Figure 4: Qualitative results for human and unseen animal motion. Red dashed boxes and red arrows highlight implausible motion artifacts.