Unimotion: Unifying 3D Human Motion Synthesis and Understanding

Chuqiao Li; Julian Chibane; Yannan He; Naama Pearl; Andreas Geiger; Gerard Pons-moll

Unimotion: Unifying 3D Human Motion Synthesis and Understanding

Chuqiao Li, Julian Chibane, Yannan He, Naama Pearl, Andreas Geiger, Gerard Pons-moll

TL;DR

The first unified multi-task human motion model capable of both flexible motion control and frame-level motion understanding is introduced, and UniMotion attains state-of-the-art results for the frame-level text-to-motion task on the established HumanML3D dataset.

Abstract

We introduce Unimotion, the first unified multi-task human motion model capable of both flexible motion control and frame-level motion understanding. While existing works control avatar motion with global text conditioning, or with fine-grained per frame scripts, none can do both at once. In addition, none of the existing works can output frame-level text paired with the generated poses. In contrast, Unimotion allows to control motion with global text, or local frame-level text, or both at once, providing more flexible control for users. Importantly, Unimotion is the first model which by design outputs local text paired with the generated poses, allowing users to know what motion happens and when, which is necessary for a wide range of applications. We show Unimotion opens up new applications: 1.) Hierarchical control, allowing users to specify motion at different levels of detail, 2.) Obtaining motion text descriptions for existing MoCap data or YouTube videos 3.) Allowing for editability, generating motion from text, and editing the motion via text edits. Moreover, Unimotion attains state-of-the-art results for the frame-level text-to-motion task on the established HumanML3D dataset. The pre-trained model and code are available available on our project page at https://coral79.github.io/uni-motion/.

Unimotion: Unifying 3D Human Motion Synthesis and Understanding

TL;DR

Abstract

Paper Structure (38 sections, 3 equations, 10 figures, 5 tables, 3 algorithms)

This paper contains 38 sections, 3 equations, 10 figures, 5 tables, 3 algorithms.

Introduction
Related Work
Conditional human motion synthesis.
Text-to-motion generation models.
Human motion understanding.
Preliminary: Motion Diffusion Model
UniMotion: Unifying Motion Synthesis and Understanding
Multi-Modal Motion and Text Diffusion
Temporally aligned Text and Motion Encoding
Data Merging
Experiments
Implementation Details.
Baselines
Evaluation Metrics
Frame-Level Text2Motion Results
...and 23 more sections

Figures (10)

Figure 1: Overview of UniMotion. UniMotion is a transformer-based diffusion model (Model) that can be input conditioned on a) human motion, b) clip embedded frame-level text, or c) sequence-level text (Input) or any subsets thereof or none, and instead supplied with noise. At it's core it allows to diffuse motion and text individually, implemented via separate denoising timesteps $t^x$ and $t^y$. After training with Frame-level text Losses and Motion losses (Loss), see Sec. \ref{['subsec:method_mmd']}. UniMotion can output clean, noise-free motion, and frame-level text descriptions explaining the generated motions. (Output)
Figure 2: Text2Motion qualitative results.Columns 1,3: Local text is the input to our method and baselines STMC petrovich2024multi (adapted) FlowMDM barquero2024seamless. Columns 2, 4: Both local and global text are the input our method and STMC. Our model performs well regardless of the complexity of the local text, in contrast to STMC which fails to generate Ginga dance in columns 3 and 4 and performs walking instead. FlowMDM cannot be conditioned on both global+local text.
Figure 3: Motion2Text understanding of MoCap and YouTube data.(a) Given an input MoCap sequence, we use UniMotion to predict frame-level local text. (b) We annotate human motion from YouTube videos with frame-level text. We lift 2D videos to 3D human motion via frame-by-frame pose estimators goel2023humans. We visualize the SMPL human pose (Pink) overlayed on the YouTube videos frames. Then we run UniMotion to predict frame-level annotations (colored text descriptions below the frames). Annotations could serve as valuable audio close captions for the visually impaired.
Figure 4: Joint text and motion generation results. Input to the models is only the global text shown on the left. We compare the generated motion of ours, MDM tevet2023human and FlowMDM barquero2024seamless. Our method jointly predicts the frame-level labels, so we can annotate sub-sequences, while MDM and FlowMDM can only generate the motion.
Figure 5: Unconditional joint text and motion generation. Our model, by design, generates poses aligned with local text.
...and 5 more figures

Unimotion: Unifying 3D Human Motion Synthesis and Understanding

TL;DR

Abstract

Unimotion: Unifying 3D Human Motion Synthesis and Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (10)