Table of Contents
Fetching ...

Aligning Human Motion Generation with Human Perceptions

Haoru Wang, Wentao Zhu, Luyi Miao, Yishu Xu, Feng Gao, Qi Tian, Yizhou Wang

TL;DR

This work tackles the misalignment between automatic evaluation metrics and human perception in realistic human motion generation. It introduces MotionPercept, a large-scale perceptual evaluation dataset with 52,563 pairwise human preferences, and MotionCritic, a neural critic trained to predict human-aligned motion quality and serving as both a metric and a supervision signal. Across extensive experiments, MotionCritic outperforms existing metrics in matching human judgments, generalizes to unseen data, and can improve generation quality with lightweight fine-tuning in diffusion-based pipelines. The framework offers a practical path toward perceptually grounded evaluation and optimization of digital humans, with public code and data to foster broader adoption.

Abstract

Human motion generation is a critical task with a wide range of applications. Achieving high realism in generated motions requires naturalness, smoothness, and plausibility. Despite rapid advancements in the field, current generation methods often fall short of these goals. Furthermore, existing evaluation metrics typically rely on ground-truth-based errors, simple heuristics, or distribution distances, which do not align well with human perceptions of motion quality. In this work, we propose a data-driven approach to bridge this gap by introducing a large-scale human perceptual evaluation dataset, MotionPercept, and a human motion critic model, MotionCritic, that capture human perceptual preferences. Our critic model offers a more accurate metric for assessing motion quality and could be readily integrated into the motion generation pipeline to enhance generation quality. Extensive experiments demonstrate the effectiveness of our approach in both evaluating and improving the quality of generated human motions by aligning with human perceptions. Code and data are publicly available at https://motioncritic.github.io/.

Aligning Human Motion Generation with Human Perceptions

TL;DR

This work tackles the misalignment between automatic evaluation metrics and human perception in realistic human motion generation. It introduces MotionPercept, a large-scale perceptual evaluation dataset with 52,563 pairwise human preferences, and MotionCritic, a neural critic trained to predict human-aligned motion quality and serving as both a metric and a supervision signal. Across extensive experiments, MotionCritic outperforms existing metrics in matching human judgments, generalizes to unseen data, and can improve generation quality with lightweight fine-tuning in diffusion-based pipelines. The framework offers a practical path toward perceptually grounded evaluation and optimization of digital humans, with public code and data to foster broader adoption.

Abstract

Human motion generation is a critical task with a wide range of applications. Achieving high realism in generated motions requires naturalness, smoothness, and plausibility. Despite rapid advancements in the field, current generation methods often fall short of these goals. Furthermore, existing evaluation metrics typically rely on ground-truth-based errors, simple heuristics, or distribution distances, which do not align well with human perceptions of motion quality. In this work, we propose a data-driven approach to bridge this gap by introducing a large-scale human perceptual evaluation dataset, MotionPercept, and a human motion critic model, MotionCritic, that capture human perceptual preferences. Our critic model offers a more accurate metric for assessing motion quality and could be readily integrated into the motion generation pipeline to enhance generation quality. Extensive experiments demonstrate the effectiveness of our approach in both evaluating and improving the quality of generated human motions by aligning with human perceptions. Code and data are publicly available at https://motioncritic.github.io/.
Paper Structure (67 sections, 8 equations, 19 figures, 15 tables, 1 algorithm)

This paper contains 67 sections, 8 equations, 19 figures, 15 tables, 1 algorithm.

Figures (19)

  • Figure 1: Framework Overview. We collect MotionPercept, a large-scale, human-annotated dataset for motion perceptual evaluation, where human subjects select the best quality motion in multiple-choice questions. Using this dataset, we train MotionCritic to automatically judge motion quality in alignment with human perceptions, offering better quality metrics. Additionally, we show that MotionCritic can enhance existing motion generators with minimal fine-tuning.
  • Figure 2: We conduct a perceptual consensus experiment with 10 subjects on 312 multiple-choice questions, each with 6 options. (A): The distribution of the number of supporters for the most chosen option in each question. (B): Distribution of the number of options chosen by all subjects for each question. (C): Pairwise agreement ratio of all subjects.
  • Figure 3: (I) Critic model training process. We sample human motion pairs $\mathbf{x}^{(h)}, \mathbf{x}^{(l)}$ annotated with human preferences, upon which the critic model produces score pairs. We use perceptual alignment loss $L_\text{Percept}$ to learn from the human perceptions. (II) Motion generation with critic model supervision. We intercept MDM sampling process at random timestep $t$ and perform single-step prediction. Critic model computes the score $s$ based on the generated motion $\mathbf{x}_0'$, which is further used to calculate motion critic loss $L_\text{Critic}$. KL loss $L_\text{KL}$ is introduced between $\mathbf{x}_0'$ and last-time generation result $\widetilde{\mathbf{x}_0}'$.
  • Figure 4: We group HumanAct12guo2020action2motion GT test set into 5 subsets, and compare their qualities. (A): GT-I to GT-V subsets split based on critic scores from high to low. (B): Elo ratings from user study, FID and average critic scores of different GT subsets.
  • Figure 5: Model performance during fine-tuning process. (A): User study win rates (row vs column) with different fine-tuned model steps. (B): Elo ratings from user study, FID and average critic scores in the fine-tuning process.
  • ...and 14 more figures