Table of Contents
Fetching ...

Pose-Aware Multi-Level Motion Parsing for Action Quality Assessment

Shuaikang Zhu, Yang Yang, Chen Sun

TL;DR

This work addresses fine-grained action quality assessment by leveraging pose information through a pose-aware, multi-level motion parsing framework. It introduces four interconnected components—Action-Unit Parser, Motion Parser, Condition Parser, and Weight-Adjust Scoring Module—to segment actions, extract rich pose/appearance/condition features, and compute weighted score differences between query and reference videos. Extensive experiments on FineDiving, FineDiving-HM, and MTL-AQA demonstrate state-of-the-art performance in both action segmentation and scoring, with ablations confirming the value of each module and the pose-centric design. The approach offers a flexible, interpretable mechanism that can adapt to varying scoring rules and action types, enabling robust, fine-grained evaluation in competitive sports and related applications.

Abstract

Human pose serves as a cornerstone of action quality assessment (AQA), where subtle spatial-temporal variations in pose often distinguish excellence from mediocrity. In high-level competitions, these nuanced differences become decisive factors in scoring. In this paper, we propose a novel multi-level motion parsing framework for AQA based on enhanced spatial-temporal pose features. On the first level, the Action-Unit Parser is designed with the help of pose extraction to achieve precise action segmentation and comprehensive local-global pose representations. On the second level, Motion Parser is used by spatial-temporal feature learning to capture pose changes and appearance details for each action-unit. Meanwhile, some special conditions other than body-related will impact action scoring, like water splash in diving. In this work, we design an additional Condition Parser to offer users more flexibility in their choices. Finally, Weight-Adjust Scoring Module is introduced to better accommodate the diverse requirements of various action types and the multi-scale nature of action-units. Extensive evaluations on large-scale diving sports datasets demonstrate that our multi-level motion parsing framework achieves state-of-the-art performance in both action segmentation and action scoring tasks.

Pose-Aware Multi-Level Motion Parsing for Action Quality Assessment

TL;DR

This work addresses fine-grained action quality assessment by leveraging pose information through a pose-aware, multi-level motion parsing framework. It introduces four interconnected components—Action-Unit Parser, Motion Parser, Condition Parser, and Weight-Adjust Scoring Module—to segment actions, extract rich pose/appearance/condition features, and compute weighted score differences between query and reference videos. Extensive experiments on FineDiving, FineDiving-HM, and MTL-AQA demonstrate state-of-the-art performance in both action segmentation and scoring, with ablations confirming the value of each module and the pose-centric design. The approach offers a flexible, interpretable mechanism that can adapt to varying scoring rules and action types, enabling robust, fine-grained evaluation in competitive sports and related applications.

Abstract

Human pose serves as a cornerstone of action quality assessment (AQA), where subtle spatial-temporal variations in pose often distinguish excellence from mediocrity. In high-level competitions, these nuanced differences become decisive factors in scoring. In this paper, we propose a novel multi-level motion parsing framework for AQA based on enhanced spatial-temporal pose features. On the first level, the Action-Unit Parser is designed with the help of pose extraction to achieve precise action segmentation and comprehensive local-global pose representations. On the second level, Motion Parser is used by spatial-temporal feature learning to capture pose changes and appearance details for each action-unit. Meanwhile, some special conditions other than body-related will impact action scoring, like water splash in diving. In this work, we design an additional Condition Parser to offer users more flexibility in their choices. Finally, Weight-Adjust Scoring Module is introduced to better accommodate the diverse requirements of various action types and the multi-scale nature of action-units. Extensive evaluations on large-scale diving sports datasets demonstrate that our multi-level motion parsing framework achieves state-of-the-art performance in both action segmentation and action scoring tasks.

Paper Structure

This paper contains 20 sections, 15 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Comparison example of pose differences and action scoring discrepancies.
  • Figure 2: The structure of a Multi-Level Motion Parsing framework is proposed. We utilize a multi-level parser to separate foreground athletes from the input video pair, extract pose information, and generate action-units. Then, we obtain appearance features, pose features, and condition features by these parser. By comparing the differences in these features, we ultimately regress to the score difference.
  • Figure 3: The visualization of the composition of the action-unit image.
  • Figure 4: The network structure of the Action Segmentation Module.
  • Figure 5: The network structure of the Pure-Pose Feature Extractor.
  • ...and 3 more figures