Table of Contents
Fetching ...

Human Motion Generation: A Survey

Wentao Zhu, Xiaoxuan Ma, Dongwoo Ro, Hai Ci, Jinlu Zhang, Jiaxin Shi, Feng Gao, Qi Tian, Yizhou Wang

TL;DR

<3-5 sentence high-level summary> This survey comprehensively maps the landscape of human motion generation, focusing on conditional generation driven by text, audio, and scene context. It delineates motion data representations, collection methods, and a taxonomy of generation techniques, including regression-based and deep generative approaches (GANs, VAEs, normalizing flows, diffusion) with examples across action-to-motion, text-to-motion, music-to-dance, speech-to-gesture, and scene-conditioned generation. It also catalogs widely used datasets and evaluation metrics, highlighting the challenges of evaluating naturalness, diversity, and condition-consistency, and discusses open problems such as data quality, semantic grounding, and controllability. The paper concludes with future directions toward data fusion, richer semantic modeling, principled evaluation, and interactive, multi-user motion generation systems.

Abstract

Human motion generation aims to generate natural human pose sequences and shows immense potential for real-world applications. Substantial progress has been made recently in motion data collection technologies and generation methods, laying the foundation for increasing interest in human motion generation. Most research within this field focuses on generating human motions based on conditional signals, such as text, audio, and scene contexts. While significant advancements have been made in recent years, the task continues to pose challenges due to the intricate nature of human motion and its implicit relationship with conditional signals. In this survey, we present a comprehensive literature review of human motion generation, which, to the best of our knowledge, is the first of its kind in this field. We begin by introducing the background of human motion and generative models, followed by an examination of representative methods for three mainstream sub-tasks: text-conditioned, audio-conditioned, and scene-conditioned human motion generation. Additionally, we provide an overview of common datasets and evaluation metrics. Lastly, we discuss open problems and outline potential future research directions. We hope that this survey could provide the community with a comprehensive glimpse of this rapidly evolving field and inspire novel ideas that address the outstanding challenges.

Human Motion Generation: A Survey

TL;DR

<3-5 sentence high-level summary> This survey comprehensively maps the landscape of human motion generation, focusing on conditional generation driven by text, audio, and scene context. It delineates motion data representations, collection methods, and a taxonomy of generation techniques, including regression-based and deep generative approaches (GANs, VAEs, normalizing flows, diffusion) with examples across action-to-motion, text-to-motion, music-to-dance, speech-to-gesture, and scene-conditioned generation. It also catalogs widely used datasets and evaluation metrics, highlighting the challenges of evaluating naturalness, diversity, and condition-consistency, and discusses open problems such as data quality, semantic grounding, and controllability. The paper concludes with future directions toward data fusion, richer semantic modeling, principled evaluation, and interactive, multi-user motion generation systems.

Abstract

Human motion generation aims to generate natural human pose sequences and shows immense potential for real-world applications. Substantial progress has been made recently in motion data collection technologies and generation methods, laying the foundation for increasing interest in human motion generation. Most research within this field focuses on generating human motions based on conditional signals, such as text, audio, and scene contexts. While significant advancements have been made in recent years, the task continues to pose challenges due to the intricate nature of human motion and its implicit relationship with conditional signals. In this survey, we present a comprehensive literature review of human motion generation, which, to the best of our knowledge, is the first of its kind in this field. We begin by introducing the background of human motion and generative models, followed by an examination of representative methods for three mainstream sub-tasks: text-conditioned, audio-conditioned, and scene-conditioned human motion generation. Additionally, we provide an overview of common datasets and evaluation metrics. Lastly, we discuss open problems and outline potential future research directions. We hope that this survey could provide the community with a comprehensive glimpse of this rapidly evolving field and inspire novel ideas that address the outstanding challenges.
Paper Structure (36 sections, 6 equations, 5 figures, 3 tables)

This paper contains 36 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: An overview of typical human motion generation approaches. Example images adapted from chang2017matterport3dtevet2022humantseng2022edgewang2022humanise.
  • Figure 2: Recent advances of human motion generation methods with different conditions.
  • Figure 3: Typical human pose and shape representations with the same pose in (a) 2D keypoints, (b) 3D keypoints, (c) 3D marker keypoints, and (d) rotation-based model.
  • Figure 4: Human motion data collection methods. (a) Examples of marker-based motion capture setup where (left) optical markers cmuWEB or (right) IMUs zhang2022couch are attached to the subject's body surface. (b) Example of the markerless multiview motion capture system HUMBI. (c) Pseudo-labeling pipeline involves using pose or mesh estimators to generate pseudo labels pavlakos2019expressive. (d) Example interface for manual collection using MikuMikuDance (MMD) resources.
  • Figure 5: An overview of different generative models.