Table of Contents
Fetching ...

Generative Action Tell-Tales: Assessing Human Motion in Synthesized Videos

Xavier Thomas, Youngsun Lim, Ananya Srinivasan, Audrey Zheng, Deepti Ghadiyaram

TL;DR

The paper tackles the difficulty of evaluating complex human actions in synthesized videos by learning a latent action manifold that fuses appearance-agnostic skeletal geometry with appearance cues. It introduces TAG-Bench, a benchmark focused on action correctness and temporal coherence, and two metrics, S_cons and S_temp, derived from the manifold embeddings. The approach uses SMPL-based 3D features, 2D keypoints, and ViT-based appearance features, combined with first-order temporal coherence and contrastive/hard-negative training to shape a robust action space. Empirically, the metrics align closely with human judgments, outperform numerous baselines on TAG-Bench (by over 68% relative), and generalize to external benchmarks, signaling a new standard for action-aware video generation evaluation.

Abstract

Despite rapid advances in video generative models, robust metrics for evaluating visual and temporal correctness of complex human actions remain elusive. Critically, existing pure-vision encoders and Multimodal Large Language Models (MLLMs) are strongly appearance-biased, lack temporal understanding, and thus struggle to discern intricate motion dynamics and anatomical implausibilities in generated videos. We tackle this gap by introducing a novel evaluation metric derived from a learned latent space of real-world human actions. Our method first captures the nuances, constraints, and temporal smoothness of real-world motion by fusing appearance-agnostic human skeletal geometry features with appearance-based features. We posit that this combined feature space provides a robust representation of action plausibility. Given a generated video, our metric quantifies its action quality by measuring the distance between its underlying representations and this learned real-world action distribution. For rigorous validation, we develop a new multi-faceted benchmark specifically designed to probe temporally challenging aspects of human action fidelity. Through extensive experiments, we show that our metric achieves substantial improvement of more than 68% compared to existing state-of-the-art methods on our benchmark, performs competitively on established external benchmarks, and has a stronger correlation with human perception. Our in-depth analysis reveals critical limitations in current video generative models and establishes a new standard for advanced research in video generation.

Generative Action Tell-Tales: Assessing Human Motion in Synthesized Videos

TL;DR

The paper tackles the difficulty of evaluating complex human actions in synthesized videos by learning a latent action manifold that fuses appearance-agnostic skeletal geometry with appearance cues. It introduces TAG-Bench, a benchmark focused on action correctness and temporal coherence, and two metrics, S_cons and S_temp, derived from the manifold embeddings. The approach uses SMPL-based 3D features, 2D keypoints, and ViT-based appearance features, combined with first-order temporal coherence and contrastive/hard-negative training to shape a robust action space. Empirically, the metrics align closely with human judgments, outperform numerous baselines on TAG-Bench (by over 68% relative), and generalize to external benchmarks, signaling a new standard for action-aware video generation evaluation.

Abstract

Despite rapid advances in video generative models, robust metrics for evaluating visual and temporal correctness of complex human actions remain elusive. Critically, existing pure-vision encoders and Multimodal Large Language Models (MLLMs) are strongly appearance-biased, lack temporal understanding, and thus struggle to discern intricate motion dynamics and anatomical implausibilities in generated videos. We tackle this gap by introducing a novel evaluation metric derived from a learned latent space of real-world human actions. Our method first captures the nuances, constraints, and temporal smoothness of real-world motion by fusing appearance-agnostic human skeletal geometry features with appearance-based features. We posit that this combined feature space provides a robust representation of action plausibility. Given a generated video, our metric quantifies its action quality by measuring the distance between its underlying representations and this learned real-world action distribution. For rigorous validation, we develop a new multi-faceted benchmark specifically designed to probe temporally challenging aspects of human action fidelity. Through extensive experiments, we show that our metric achieves substantial improvement of more than 68% compared to existing state-of-the-art methods on our benchmark, performs competitively on established external benchmarks, and has a stronger correlation with human perception. Our in-depth analysis reveals critical limitations in current video generative models and establishes a new standard for advanced research in video generation.

Paper Structure

This paper contains 41 sections, 6 equations, 16 figures, 11 tables.

Figures (16)

  • Figure 1: What are the telltale signs of a generative action video? We answer this by learning a robust manifold based on appearance and anatomical coherence exhibited by humans performing actions across several real-world videos. This manifold serves as anchors against which we project the features of a generated video in question and assess its realism.
  • Figure 2: Architectural overview of the encoder we train to learn the real-world action manifold. We extract per-frame static human-centric and temporal motion features (Fig. (a)) (Sec. \ref{['sec:human_features']}), and aggregate them, yielding one embedding for each frame (Fig. (b)) (Sec. \ref{['sec:model_training']}). We prepend a $[\text{CLS}]$ token to the per-frame tokens and pass as input to a 4-layer transformer encoder (Fig. (c)) (Sec. \ref{['sec:model_training']}). Our aim is to encourage the encoder to group diverse videos pertaining to a given action closer together. We also ensure that temporally incoherent videos lie farther apart.
  • Figure 3: Model comparisons on TAG-Bench and VBench-2.0 Human Anatomy. We compare models pairwise for the same input prompt; for each pair, the model with the higher score (human or metric) is the winner. We then plot the win ratios (see Sec. \ref{['sec:vbench']}) of human scores (x-axis) against win ratios from our metric (y-axis). Our metrics ($S_{\mathrm{cons}}$ and $S_{\mathrm{temp}}$) observe the same ranking of models as humans on both benchmarks.
  • Figure 4: Comparing generative models. We plot the mean $S_{\mathrm{cons}}$ and $S_{\mathrm{temp}}$ scores (Sec. \ref{['sec:embedding_metrics']}) (lower is better) for each generative model across different actions. Wan2.2 performs best among the other models (low scores in both $S_{\mathrm{cons}}$ and $S_{\mathrm{temp}}$). Shotput and JumpingJack challenge all models, yielding high scores across both metrics.
  • Figure 5: t-SNE visualization of the embeddings of generated videos along with train centroids. We project the $z_{\text{CLS}}$ embeddings of generated videos (colored markers) from TAG-Bench and the corresponding training class centroids (white crosses) using t-SNE tsne. Realistic generated videos cluster near their respective class centroids (e.g., Wan2.2 videos for "PullUps", with an average human rating of: $\mathbf{8.41}$ for Action Consistency), while poorly generated videos lie further away (e.g., Wan2.2 videos for "Shotput" with an average human rating of: $\mathbf{4.43}$) (See Sec. \ref{['sec:human_eval']}).
  • ...and 11 more figures