Table of Contents
Fetching ...

GAIA: Rethinking Action Quality Assessment for AI-Generated Videos

Zijian Chen, Wei Sun, Yuan Tian, Jun Jia, Zicheng Zhang, Jiarui Wang, Ru Huang, Xiongkuo Min, Guangtao Zhai, Wenjun Zhang

TL;DR

GAIA is constructed, a Generic AI-generated Action dataset, by conducting a large-scale subjective evaluation from a novel causal reasoning-based perspective, resulting in 971,244 ratings among 9,180 video-action pairs, and a suite of popular text-to-video models are evaluated on their ability to generate visually rational actions.

Abstract

Assessing action quality is both imperative and challenging due to its significant impact on the quality of AI-generated videos, further complicated by the inherently ambiguous nature of actions within AI-generated video (AIGV). Current action quality assessment (AQA) algorithms predominantly focus on actions from real specific scenarios and are pre-trained with normative action features, thus rendering them inapplicable in AIGVs. To address these problems, we construct GAIA, a Generic AI-generated Action dataset, by conducting a large-scale subjective evaluation from a novel causal reasoning-based perspective, resulting in 971,244 ratings among 9,180 video-action pairs. Based on GAIA, we evaluate a suite of popular text-to-video (T2V) models on their ability to generate visually rational actions, revealing their pros and cons on different categories of actions. We also extend GAIA as a testbed to benchmark the AQA capacity of existing automatic evaluation methods. Results show that traditional AQA methods, action-related metrics in recent T2V benchmarks, and mainstream video quality methods perform poorly with an average SRCC of 0.454, 0.191, and 0.519, respectively, indicating a sizable gap between current models and human action perception patterns in AIGVs. Our findings underscore the significance of action quality as a unique perspective for studying AIGVs and can catalyze progress towards methods with enhanced capacities for AQA in AIGVs.

GAIA: Rethinking Action Quality Assessment for AI-Generated Videos

TL;DR

GAIA is constructed, a Generic AI-generated Action dataset, by conducting a large-scale subjective evaluation from a novel causal reasoning-based perspective, resulting in 971,244 ratings among 9,180 video-action pairs, and a suite of popular text-to-video models are evaluated on their ability to generate visually rational actions.

Abstract

Assessing action quality is both imperative and challenging due to its significant impact on the quality of AI-generated videos, further complicated by the inherently ambiguous nature of actions within AI-generated video (AIGV). Current action quality assessment (AQA) algorithms predominantly focus on actions from real specific scenarios and are pre-trained with normative action features, thus rendering them inapplicable in AIGVs. To address these problems, we construct GAIA, a Generic AI-generated Action dataset, by conducting a large-scale subjective evaluation from a novel causal reasoning-based perspective, resulting in 971,244 ratings among 9,180 video-action pairs. Based on GAIA, we evaluate a suite of popular text-to-video (T2V) models on their ability to generate visually rational actions, revealing their pros and cons on different categories of actions. We also extend GAIA as a testbed to benchmark the AQA capacity of existing automatic evaluation methods. Results show that traditional AQA methods, action-related metrics in recent T2V benchmarks, and mainstream video quality methods perform poorly with an average SRCC of 0.454, 0.191, and 0.519, respectively, indicating a sizable gap between current models and human action perception patterns in AIGVs. Our findings underscore the significance of action quality as a unique perspective for studying AIGVs and can catalyze progress towards methods with enhanced capacities for AQA in AIGVs.
Paper Structure (26 sections, 5 equations, 15 figures, 15 tables)

This paper contains 26 sections, 5 equations, 15 figures, 15 tables.

Figures (15)

  • Figure 1: Data construction pipeline and content overview of GAIA. (a) Curation process of the GAIA dataset, resulting in 9,180 videos with 971,244 human ratings. (b) The distribution of unique actions per class. (c) 3D scatter plot of the mean opinion score (MOS) in three dimensions and video examples with diverged scores.
  • Figure 2: SRCC between MOSs as the observers increases.
  • Figure 3: MOS distributions across different models in terms of subject quality, action completeness, and action-scene interaction. 11 Lab studies: (a)-(k); 7 Commercial applications: (l)-(r).
  • Figure 4: Visualization of generated videos: Sort by subject quality from highest to lowest. The action keyword (relatively small ( left) and large ( right) movement) is highlighted in pink.
  • Figure 5: Box plots of $\mathrm{MOS_{s}}$, $\mathrm{MOS_{c}}$, and $\mathrm{MOS_{i}}$ across action categories. (a), (b), and (c) show whole-body actions. (d) and (f) show hand and facial actions. For each box, median is the central box, and the edges of the box represent the 25th and 75th percentiles, while red circles denote outliers.
  • ...and 10 more figures