Generative Action Tell-Tales: Assessing Human Motion in Synthesized Videos
Xavier Thomas, Youngsun Lim, Ananya Srinivasan, Audrey Zheng, Deepti Ghadiyaram
TL;DR
The paper tackles the difficulty of evaluating complex human actions in synthesized videos by learning a latent action manifold that fuses appearance-agnostic skeletal geometry with appearance cues. It introduces TAG-Bench, a benchmark focused on action correctness and temporal coherence, and two metrics, S_cons and S_temp, derived from the manifold embeddings. The approach uses SMPL-based 3D features, 2D keypoints, and ViT-based appearance features, combined with first-order temporal coherence and contrastive/hard-negative training to shape a robust action space. Empirically, the metrics align closely with human judgments, outperform numerous baselines on TAG-Bench (by over 68% relative), and generalize to external benchmarks, signaling a new standard for action-aware video generation evaluation.
Abstract
Despite rapid advances in video generative models, robust metrics for evaluating visual and temporal correctness of complex human actions remain elusive. Critically, existing pure-vision encoders and Multimodal Large Language Models (MLLMs) are strongly appearance-biased, lack temporal understanding, and thus struggle to discern intricate motion dynamics and anatomical implausibilities in generated videos. We tackle this gap by introducing a novel evaluation metric derived from a learned latent space of real-world human actions. Our method first captures the nuances, constraints, and temporal smoothness of real-world motion by fusing appearance-agnostic human skeletal geometry features with appearance-based features. We posit that this combined feature space provides a robust representation of action plausibility. Given a generated video, our metric quantifies its action quality by measuring the distance between its underlying representations and this learned real-world action distribution. For rigorous validation, we develop a new multi-faceted benchmark specifically designed to probe temporally challenging aspects of human action fidelity. Through extensive experiments, we show that our metric achieves substantial improvement of more than 68% compared to existing state-of-the-art methods on our benchmark, performs competitively on established external benchmarks, and has a stronger correlation with human perception. Our in-depth analysis reveals critical limitations in current video generative models and establishes a new standard for advanced research in video generation.
