Table of Contents
Fetching ...

UniMoGen: Universal Motion Generation

Aliasghar Khani, Arianna Rampini, Evan Atherton, Bruno Roy

TL;DR

UniMoGen tackles the challenge of skeleton-agnostic motion generation by introducing a UNet-based diffusion model that processes variable joint counts without padding. It supports auto-regressive generation conditioned on style, trajectory, and past frames, using temporal downsampling, joint-wise attention with topology-aware masking, and FiLM conditioning, enabling real-time synthesis across diverse skeletons. Across 100style and the combined 100style+LAFAN1 datasets, UniMoGen outperforms state-of-the-art diffusion and skeleton-agnostic baselines in realism, diversity, and physical plausibility, while significantly reducing computational overhead. The work demonstrates accurate trajectory following, smooth long-horizon motion, and effective style blending, highlighting its potential for versatile animation, gaming, and robotics applications.

Abstract

Motion generation is a cornerstone of computer graphics, animation, gaming, and robotics, enabling the creation of realistic and varied character movements. A significant limitation of existing methods is their reliance on specific skeletal structures, which restricts their versatility across different characters. To overcome this, we introduce UniMoGen, a novel UNet-based diffusion model designed for skeleton-agnostic motion generation. UniMoGen can be trained on motion data from diverse characters, such as humans and animals, without the need for a predefined maximum number of joints. By dynamically processing only the necessary joints for each character, our model achieves both skeleton agnosticism and computational efficiency. Key features of UniMoGen include controllability via style and trajectory inputs, and the ability to continue motions from past frames. We demonstrate UniMoGen's effectiveness on the 100style dataset, where it outperforms state-of-the-art methods in diverse character motion generation. Furthermore, when trained on both the 100style and LAFAN1 datasets, which use different skeletons, UniMoGen achieves high performance and improved efficiency across both skeletons. These results highlight UniMoGen's potential to advance motion generation by providing a flexible, efficient, and controllable solution for a wide range of character animations.

UniMoGen: Universal Motion Generation

TL;DR

UniMoGen tackles the challenge of skeleton-agnostic motion generation by introducing a UNet-based diffusion model that processes variable joint counts without padding. It supports auto-regressive generation conditioned on style, trajectory, and past frames, using temporal downsampling, joint-wise attention with topology-aware masking, and FiLM conditioning, enabling real-time synthesis across diverse skeletons. Across 100style and the combined 100style+LAFAN1 datasets, UniMoGen outperforms state-of-the-art diffusion and skeleton-agnostic baselines in realism, diversity, and physical plausibility, while significantly reducing computational overhead. The work demonstrates accurate trajectory following, smooth long-horizon motion, and effective style blending, highlighting its potential for versatile animation, gaming, and robotics applications.

Abstract

Motion generation is a cornerstone of computer graphics, animation, gaming, and robotics, enabling the creation of realistic and varied character movements. A significant limitation of existing methods is their reliance on specific skeletal structures, which restricts their versatility across different characters. To overcome this, we introduce UniMoGen, a novel UNet-based diffusion model designed for skeleton-agnostic motion generation. UniMoGen can be trained on motion data from diverse characters, such as humans and animals, without the need for a predefined maximum number of joints. By dynamically processing only the necessary joints for each character, our model achieves both skeleton agnosticism and computational efficiency. Key features of UniMoGen include controllability via style and trajectory inputs, and the ability to continue motions from past frames. We demonstrate UniMoGen's effectiveness on the 100style dataset, where it outperforms state-of-the-art methods in diverse character motion generation. Furthermore, when trained on both the 100style and LAFAN1 datasets, which use different skeletons, UniMoGen achieves high performance and improved efficiency across both skeletons. These results highlight UniMoGen's potential to advance motion generation by providing a flexible, efficient, and controllable solution for a wide range of character animations.

Paper Structure

This paper contains 17 sections, 1 equation, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Overview of the UniMoGen denoising architecture. During training, the model receives style index $S$, past motion inputs as root positions $P_p$ and joint rotations $P_r$, trajectory $(T_p, T_r)$, and diffusion time step $t$. Dedicated modules process each input, and their representations are fused in a UNet-based diffusion network. The network leverages temporal and joint-level self-attention, cross-attention to inject trajectory information, and Feature-wise Linear Modulation (FiLM) film to condition on time and style. The model outputs future motion $(C_p, C_r)$, enabling controllable, skeleton-agnostic generation across diverse characters. As illustrated in the figure, we omit attention modules in the first and last layers of the UNet and apply them only to the downsampled layers to reduce memory consumption.
  • Figure 2: Style blending with UniMoGen. Visualization of motions generated by blending two styles: Aeroplane and Arms Above Head. Purple shows $100\%$ Aeroplane and $0\%$ Arms Above Head, Green shows a blend of $35\%$ Aeroplane and $65\%$ Arms Above Head, and Orange shows $0\%$ Aeroplane and $100\%$ Arms Above Head. The smooth transition illustrates the expressive and continuous nature of the learned style space.
  • Figure 3: Onion skinning visualization of UniMoGen and CAMDM results. The top and bottom figures compare motion outputs from UniMoGen and CAMDM, given the same past frames, style, and trajectory. As shown, our model exhibits noticeably less foot sliding and penetration. These issues are highlighted with ellipses in the CAMDM results for clarity.
  • Figure 4: Multi-Skeleton Generation. Left: a motion generated for the skeleton of LAFAN1. Right: a motion generated for the skeleton of 100Style. Both the skeletons are generated by the same model, which is trained on the combination the two datasets.