Table of Contents
Fetching ...

Establishing a Unified Evaluation Framework for Human Motion Generation: A Comparative Analysis of Metrics

Ali Ismail-Fawaz, Maxime Devanne, Stefano Berretti, Jonathan Weber, Germain Forestier

TL;DR

The paper tackles the lack of a unified, quantitative framework for evaluating human motion generation by surveying fidelity and diversity metrics and proposing a unified evaluation setup. It introduces Warping Path Diversity (WPD), a DTW-based metric to capture temporal distortions in sequences, and validates a cohesive evaluation pipeline using three CVAEs trained on the HumanAct12 dataset. Key contributions include a comprehensive taxonomy of metrics (FID, AOG, density, precision, APD, ACPD, coverage, MMS), a formal definition of WPD with a DTW-based derivation, and empirical insights showing that model selection depends on the target application rather than a single metric. The work provides publicly accessible code to facilitate reproducible, multi-metric comparisons and aims to equip newcomers with a practical starting point for evaluating human motion generation in a standardized way.

Abstract

The development of generative artificial intelligence for human motion generation has expanded rapidly, necessitating a unified evaluation framework. This paper presents a detailed review of eight evaluation metrics for human motion generation, highlighting their unique features and shortcomings. We propose standardized practices through a unified evaluation setup to facilitate consistent model comparisons. Additionally, we introduce a novel metric that assesses diversity in temporal distortion by analyzing warping diversity, thereby enhancing the evaluation of temporal data. We also conduct experimental analyses of three generative models using a publicly available dataset, offering insights into the interpretation of each metric in specific case scenarios. Our goal is to offer a clear, user-friendly evaluation framework for newcomers, complemented by publicly accessible code.

Establishing a Unified Evaluation Framework for Human Motion Generation: A Comparative Analysis of Metrics

TL;DR

The paper tackles the lack of a unified, quantitative framework for evaluating human motion generation by surveying fidelity and diversity metrics and proposing a unified evaluation setup. It introduces Warping Path Diversity (WPD), a DTW-based metric to capture temporal distortions in sequences, and validates a cohesive evaluation pipeline using three CVAEs trained on the HumanAct12 dataset. Key contributions include a comprehensive taxonomy of metrics (FID, AOG, density, precision, APD, ACPD, coverage, MMS), a formal definition of WPD with a DTW-based derivation, and empirical insights showing that model selection depends on the target application rather than a single metric. The work provides publicly accessible code to facilitate reproducible, multi-metric comparisons and aims to equip newcomers with a practical starting point for evaluating human motion generation in a standardized way.

Abstract

The development of generative artificial intelligence for human motion generation has expanded rapidly, necessitating a unified evaluation framework. This paper presents a detailed review of eight evaluation metrics for human motion generation, highlighting their unique features and shortcomings. We propose standardized practices through a unified evaluation setup to facilitate consistent model comparisons. Additionally, we introduce a novel metric that assesses diversity in temporal distortion by analyzing warping diversity, thereby enhancing the evaluation of temporal data. We also conduct experimental analyses of three generative models using a publicly available dataset, offering insights into the interpretation of each metric in specific case scenarios. Our goal is to offer a clear, user-friendly evaluation framework for newcomers, complemented by publicly accessible code.
Paper Structure (48 sections, 2 theorems, 26 equations, 14 figures, 1 table, 1 algorithm)

This paper contains 48 sections, 2 theorems, 26 equations, 14 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

a generative model $Gen_1$ is considered more fidelitous than another model $Gen_2$ on the $FID$ metric if $FID_{gen1} < FID_{gen2}$ while respecting the following constraint:

Figures (14)

  • Figure 1: Summary of all the evaluation metrics for human motion generation used in this work. First, the metrics are divided into two groups: the fidelity metrics in blue, and diversity metrics in green. Second, the metrics are divided into sub-groups indicating the measure is based on which criteria, such as FID being a distribution based metric.
  • Figure 2: Two steps are followed prior to calculating evaluation measures. In the first step, a model is trained on some supervised task using the real set of data (in blue). This model consists of a feature extractor encoder (in green) and a final layer for the supervised task. In the second step, the pre-trained encoder's latent representation of the real (in blue) is extracted, and samples are generated (in red). The metrics are then computed over this latent representation.
  • Figure 3: On the left, an example computation of the amount of energy needed ($FID$) to transform the standard Gaussian distribution (in blue) to another Gaussian distribution with higher values of $\mu$ and $\sigma^2$ (in red). On the right, the amount of energy ($FID$) needed for this transformation, gradually increasing with the increase of $\mu$ and $\sigma^2$, can be observed.
  • Figure 4: This example showcases the computation of the $AOG$ metric for two generative models. Given a real space of data, where the samples are spread over three sets of classes (red triangles, blue squares, and green circles), the generative model should be able to conditionally generate new samples. This condition is simply to which of the possible classes the generated example should belong to. From the example, two generative models should be able to generate six samples, each conditioned on a class. The set of conditions used is $\hat{Y}$ (the ground truth). Posterior to generating, a pre-trained model classifies the generated samples. Hence, the $AOG$ metric corresponds to the accuracy of classification between the predicted class provided by the pre-trained model and the ground truth. In this example, Model1 obtains the optimal $AOG$ value of $100\%$, while Model2 struggles to correctly condition its generation ($AOG$ of $50\%$.
  • Figure 5: This example showcases the computation of both the $density$ and $precision$ metrics over a synthetic dataset. On the left, we present the latent representation of the real set of data (blue points), the real outlier (red point), the generated examples around the outlier (black points), and generated examples around non-outliers (in green). The circles in blue represent the neighborhood area of each latent point of a real sample, for both outlier and non-outlier real samples. On the right, we present the original series space of the data with the same associated colors. It can be seen that the $density$ metric overcomes the existence of an outlier and does not produce a perfect measure, $1.25 \neq 1$. However the $precision$ does not detect the outlier and produces a perfect measure of $1$. This example highlights how the $density$ metric concludes that the generated samples are not fully fidelitous to the real distribution because of the generations around the real outlier that a generative model should overcome. The number of neighbors used for both metrics is set to $2$.
  • ...and 9 more figures

Theorems & Definitions (2)

  • Theorem 1
  • Theorem 2