Pose-Aware Multi-Level Motion Parsing for Action Quality Assessment
Shuaikang Zhu, Yang Yang, Chen Sun
TL;DR
This work addresses fine-grained action quality assessment by leveraging pose information through a pose-aware, multi-level motion parsing framework. It introduces four interconnected components—Action-Unit Parser, Motion Parser, Condition Parser, and Weight-Adjust Scoring Module—to segment actions, extract rich pose/appearance/condition features, and compute weighted score differences between query and reference videos. Extensive experiments on FineDiving, FineDiving-HM, and MTL-AQA demonstrate state-of-the-art performance in both action segmentation and scoring, with ablations confirming the value of each module and the pose-centric design. The approach offers a flexible, interpretable mechanism that can adapt to varying scoring rules and action types, enabling robust, fine-grained evaluation in competitive sports and related applications.
Abstract
Human pose serves as a cornerstone of action quality assessment (AQA), where subtle spatial-temporal variations in pose often distinguish excellence from mediocrity. In high-level competitions, these nuanced differences become decisive factors in scoring. In this paper, we propose a novel multi-level motion parsing framework for AQA based on enhanced spatial-temporal pose features. On the first level, the Action-Unit Parser is designed with the help of pose extraction to achieve precise action segmentation and comprehensive local-global pose representations. On the second level, Motion Parser is used by spatial-temporal feature learning to capture pose changes and appearance details for each action-unit. Meanwhile, some special conditions other than body-related will impact action scoring, like water splash in diving. In this work, we design an additional Condition Parser to offer users more flexibility in their choices. Finally, Weight-Adjust Scoring Module is introduced to better accommodate the diverse requirements of various action types and the multi-scale nature of action-units. Extensive evaluations on large-scale diving sports datasets demonstrate that our multi-level motion parsing framework achieves state-of-the-art performance in both action segmentation and action scoring tasks.
