Establishing a Unified Evaluation Framework for Human Motion Generation: A Comparative Analysis of Metrics
Ali Ismail-Fawaz, Maxime Devanne, Stefano Berretti, Jonathan Weber, Germain Forestier
TL;DR
The paper tackles the lack of a unified, quantitative framework for evaluating human motion generation by surveying fidelity and diversity metrics and proposing a unified evaluation setup. It introduces Warping Path Diversity (WPD), a DTW-based metric to capture temporal distortions in sequences, and validates a cohesive evaluation pipeline using three CVAEs trained on the HumanAct12 dataset. Key contributions include a comprehensive taxonomy of metrics (FID, AOG, density, precision, APD, ACPD, coverage, MMS), a formal definition of WPD with a DTW-based derivation, and empirical insights showing that model selection depends on the target application rather than a single metric. The work provides publicly accessible code to facilitate reproducible, multi-metric comparisons and aims to equip newcomers with a practical starting point for evaluating human motion generation in a standardized way.
Abstract
The development of generative artificial intelligence for human motion generation has expanded rapidly, necessitating a unified evaluation framework. This paper presents a detailed review of eight evaluation metrics for human motion generation, highlighting their unique features and shortcomings. We propose standardized practices through a unified evaluation setup to facilitate consistent model comparisons. Additionally, we introduce a novel metric that assesses diversity in temporal distortion by analyzing warping diversity, thereby enhancing the evaluation of temporal data. We also conduct experimental analyses of three generative models using a publicly available dataset, offering insights into the interpretation of each metric in specific case scenarios. Our goal is to offer a clear, user-friendly evaluation framework for newcomers, complemented by publicly accessible code.
