Table of Contents
Fetching ...

A Comprehensive Survey of Action Quality Assessment: Method and Benchmark

Kanglei Zhou, Ruizhi Cai, Liyuan Wang, Hubert P. H. Shum, Xiaohui Liang

TL;DR

This survey tackles fragmentation in Action Quality Assessment (AQA) by introducing a hierarchical taxonomy based on input modalities and by constructing a unified, open benchmark that integrates six datasets and seven evaluation metrics to compare both accuracy and computation. It analyzes over 150 papers to reveal how video-, skeleton-, and multi-modal approaches interrelate, and it discusses emerging trends, challenges, and future directions. The authors also highlight three task-specific AQA applications—semi-supervised, continual, and interpretable AQA—along with practical insights for cross-domain generalization and deployment. Overall, the work provides a standardized foundation to evaluate AQA methods, guiding future research toward robust, scalable, and interpretable action-quality assessment in diverse real-world contexts.

Abstract

Action Quality Assessment (AQA) quantitatively evaluates the quality of human actions, providing automated assessments that reduce biases in human judgment. Its applications span domains such as sports analysis, skill assessment, and medical care. Recent advances in AQA have introduced innovative methodologies, but similar methods often intertwine across different domains, highlighting the fragmented nature that hinders systematic reviews. In addition, the lack of a unified benchmark and limited computational comparisons hinder consistent evaluation and fair assessment of AQA approaches. In this work, we address these gaps by systematically analyzing over 150 AQA-related papers to develop a hierarchical taxonomy, construct a unified benchmark, and provide an in-depth analysis of current trends, challenges, and future directions. Our hierarchical taxonomy categorizes AQA methods based on input modalities (video, skeleton, multi-modal) and their specific characteristics, highlighting the evolution and interrelations across various approaches. To promote standardization, we present a unified benchmark, integrating diverse datasets to evaluate the assessment precision and computational efficiency. Finally, we review emerging task-specific applications and identify under-explored challenges in AQA, providing actionable insights into future research directions. This survey aims to deepen understanding of AQA progress, facilitate method comparison, and guide future innovations. The project web page can be found at https://ZhouKanglei.github.io/AQA-Survey.

A Comprehensive Survey of Action Quality Assessment: Method and Benchmark

TL;DR

This survey tackles fragmentation in Action Quality Assessment (AQA) by introducing a hierarchical taxonomy based on input modalities and by constructing a unified, open benchmark that integrates six datasets and seven evaluation metrics to compare both accuracy and computation. It analyzes over 150 papers to reveal how video-, skeleton-, and multi-modal approaches interrelate, and it discusses emerging trends, challenges, and future directions. The authors also highlight three task-specific AQA applications—semi-supervised, continual, and interpretable AQA—along with practical insights for cross-domain generalization and deployment. Overall, the work provides a standardized foundation to evaluate AQA methods, guiding future research toward robust, scalable, and interpretable action-quality assessment in diverse real-world contexts.

Abstract

Action Quality Assessment (AQA) quantitatively evaluates the quality of human actions, providing automated assessments that reduce biases in human judgment. Its applications span domains such as sports analysis, skill assessment, and medical care. Recent advances in AQA have introduced innovative methodologies, but similar methods often intertwine across different domains, highlighting the fragmented nature that hinders systematic reviews. In addition, the lack of a unified benchmark and limited computational comparisons hinder consistent evaluation and fair assessment of AQA approaches. In this work, we address these gaps by systematically analyzing over 150 AQA-related papers to develop a hierarchical taxonomy, construct a unified benchmark, and provide an in-depth analysis of current trends, challenges, and future directions. Our hierarchical taxonomy categorizes AQA methods based on input modalities (video, skeleton, multi-modal) and their specific characteristics, highlighting the evolution and interrelations across various approaches. To promote standardization, we present a unified benchmark, integrating diverse datasets to evaluate the assessment precision and computational efficiency. Finally, we review emerging task-specific applications and identify under-explored challenges in AQA, providing actionable insights into future research directions. This survey aims to deepen understanding of AQA progress, facilitate method comparison, and guide future innovations. The project web page can be found at https://ZhouKanglei.github.io/AQA-Survey.

Paper Structure

This paper contains 43 sections, 5 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Annual statistics of AQA papers in CV and ML conferences or journals, categorized by (a) AQA application domains and (b) emerging research directions in AQA that have often been overlooked in previous surveys. The notable rise in publications and the evolving trends in recent years highlight the need for a comprehensive survey.
  • Figure 2: The overall structure of our comprehensive survey. Our survey presents three core contributions: a hierarchical taxonomy systematically organizing intertwined papers with in-depth analysis, a unified benchmark to ensure consistent evaluation and fair comparison, and an exploration of task-specific applications in AQA beyond common setup, under-explored challenges, and prospects.
  • Figure 3: Illustration of the common AQA framework, consisting of three core components: (1) the backbone for feature extraction, which processes diverse input data modalities such as video, skeleton, or sensor data; (2) the network neck for representation learning, responsible for deriving meaningful embeddings from the extracted features; and (3) the regression or classification head, which generates outputs as continuous scores (Case 1), discrete grades (Case 2), or ranks (Case 3).
  • Figure 4: Three typical fine-grained reasoning approaches in AQA. (a) This spatial reasoning method employs an off-the-shelf pose or object detector to generate object-centric masks, which are integrated into the intermediate layers of backbone networks, as seen in wang2021tsa. (b) This approach leverages a Siamese backbone that processes raw video alongside object- or pose-masked/centric data, using a contrastive learning module to direct attention toward object-centric regions nagai2021actiongedamu2023finegedamu2024self. Notably, certain methods zeng2020hybridchen2024long opt for unshared parameters between the branches to facilitate the extraction of rich and hybrid features. (c) This fine-grained temporal-aware method typically incorporates a temporal parsing module to derive subactions, either implicitly or explicitly xu2022finedivingxu2024fineparserzhou2023hierarchicalan2024multi.
  • Figure 5: The procedural nature of actions in fine temporal modeling for AQA. (a) illustrates a diving example from the FineDiving dataset xu2022finediving, while (b) depicts a suturing example from the JIGSAWS dataset gao2014jhu.
  • ...and 3 more figures