Human Motion Generation: A Survey
Wentao Zhu, Xiaoxuan Ma, Dongwoo Ro, Hai Ci, Jinlu Zhang, Jiaxin Shi, Feng Gao, Qi Tian, Yizhou Wang
TL;DR
<3-5 sentence high-level summary> This survey comprehensively maps the landscape of human motion generation, focusing on conditional generation driven by text, audio, and scene context. It delineates motion data representations, collection methods, and a taxonomy of generation techniques, including regression-based and deep generative approaches (GANs, VAEs, normalizing flows, diffusion) with examples across action-to-motion, text-to-motion, music-to-dance, speech-to-gesture, and scene-conditioned generation. It also catalogs widely used datasets and evaluation metrics, highlighting the challenges of evaluating naturalness, diversity, and condition-consistency, and discusses open problems such as data quality, semantic grounding, and controllability. The paper concludes with future directions toward data fusion, richer semantic modeling, principled evaluation, and interactive, multi-user motion generation systems.
Abstract
Human motion generation aims to generate natural human pose sequences and shows immense potential for real-world applications. Substantial progress has been made recently in motion data collection technologies and generation methods, laying the foundation for increasing interest in human motion generation. Most research within this field focuses on generating human motions based on conditional signals, such as text, audio, and scene contexts. While significant advancements have been made in recent years, the task continues to pose challenges due to the intricate nature of human motion and its implicit relationship with conditional signals. In this survey, we present a comprehensive literature review of human motion generation, which, to the best of our knowledge, is the first of its kind in this field. We begin by introducing the background of human motion and generative models, followed by an examination of representative methods for three mainstream sub-tasks: text-conditioned, audio-conditioned, and scene-conditioned human motion generation. Additionally, we provide an overview of common datasets and evaluation metrics. Lastly, we discuss open problems and outline potential future research directions. We hope that this survey could provide the community with a comprehensive glimpse of this rapidly evolving field and inspire novel ideas that address the outstanding challenges.
