Action Quality Assessment via Hierarchical Pose-guided Multi-stage Contrastive Regression

Mengshi Qi; Hao Ye; Jiaxuan Peng; Huadong Ma

Action Quality Assessment via Hierarchical Pose-guided Multi-stage Contrastive Regression

Mengshi Qi, Hao Ye, Jiaxuan Peng, Huadong Ma

TL;DR

The paper tackles Action Quality Assessment by addressing fine-grained sub-action differences and temporal segmentation challenges. It presents HP-MCoRe, a hierarchical pose-guided multi-stage contrastive regression framework that fuses static visual, dynamic visual, and hierarchical skeletal cues through procedure segmentation and stage-wise learning, augmented by a splash-aware supervision. A new FINEDIVING-POSE dataset with manual and automatic pose annotations enhances pose quality for AQA research. Empirical results on FineDiving and MTL-AQA demonstrate state-of-the-art performance and strong ablations, with code and dataset publicly available.

Abstract

Action Quality Assessment (AQA), which aims at automatic and fair evaluation of athletic performance, has gained increasing attention in recent years. However, athletes are often in rapid movement and the corresponding visual appearance variances are subtle, making it challenging to capture fine-grained pose differences and leading to poor estimation performance. Furthermore, most common AQA tasks, such as diving in sports, are usually divided into multiple sub-actions, each of which contains different durations. However, existing methods focus on segmenting the video into fixed frames, which disrupts the temporal continuity of sub-actions resulting in unavoidable prediction errors. To address these challenges, we propose a novel action quality assessment method through hierarchically pose-guided multi-stage contrastive regression. Firstly, we introduce a multi-scale dynamic visual-skeleton encoder to capture fine-grained spatio-temporal visual and skeletal features. Then, a procedure segmentation network is introduced to separate different sub-actions and obtain segmented features. Afterwards, the segmented visual and skeletal features are both fed into a multi-modal fusion module as physics structural priors, to guide the model in learning refined activity similarities and variances. Finally, a multi-stage contrastive learning regression approach is employed to learn discriminative representations and output prediction results. In addition, we introduce a newly-annotated FineDiving-Pose Dataset to improve the current low-quality human pose labels. In experiments, the results on FineDiving and MTL-AQA datasets demonstrate the effectiveness and superiority of our proposed approach. Our source code and dataset are available at https://github.com/Lumos0507/HP-MCoRe.

Action Quality Assessment via Hierarchical Pose-guided Multi-stage Contrastive Regression

TL;DR

Abstract

Paper Structure (19 sections, 34 equations, 7 figures, 6 tables)

This paper contains 19 sections, 34 equations, 7 figures, 6 tables.

Introduction
RELATED WORK
PROPOSED APPROACH
Overview
Multi-Scale Visual-Skeletal Encoder
Procedure Segmentation Module
Multi-Modal Fusion Module
Multi-stage Contrastive Regression Module
Optimization and Inference
FINEDIVING-POSE DATASET
Data Source
Annotation Pipeline
Visualization Analysis of Annotated Data
EXPERIMENT
Experiment Settings
...and 4 more sections

Figures (7)

Figure 1: Illustration of our proposed framework for AQA. The framework integrates both human visual and hierarchical skeletal information to capture fine-grained features and physical priors for the high-quality action assessment. Additionally, we introduce a procedure segmentation module that dynamically models the sub-action sequences. Subsequently, we fuse the skeletal and visual features to derive spatiotemporal features. Finally, we propose a contrastive learning-based regression approach to enhance the evaluation accuracy.
Figure 2: Overview of our proposed hierarchical pose-guided multi-stage action quality assessment framework. Encoder: The framework takes query pairs ${V^q, P^q}$ as input for testing, while exemplar pairs ${V^e, P^e}$ are selected from an existing dataset. The dynamic visual encoder and the (c) hierarchical skeletal encoder capture the spatiotemporal visual and pose features $F_{dy}$ and $F_{sk}$. The static visual encoder captures static human features $F_{st}$. The (d) procedure segmentation network segments these features into $\mathbb K$ stages, resulting in $F_{dy}^k$, $F_{sk}^k$ and $F_{st}^k$. The (e) stage-wise contrastive loss is applied to enhance segmentation accuracy. The multi-stage features of $F_{dy}^k$ and $F_{sk}^k$ are input into the multi-modal fusion module to obtain the fused features $F_{fu}^k$. Regressor:The inputs are fused dynamic features and static features of the $k$-th segmentation, denoted as $F_{fu}^{(q,k)}$, $F_{st}^{(q,k)}$, $F_{fu}^{(e,k)}$ and $F_{st}^{(e,k)}$. The static features and fused features are fed into the stage contrastive regression module separately to obtain the relative scores $S_{fu}$ and $S_{st}$. Finally, a confidence value $\lambda$ is applied, and the exemplar score $S_e$ is added to produce the predicted score $\hat{\mathit{S^q}}$.
Figure 3: The illustration of the hierarchical pose encoding. We decompose the human pose into three levels using a graph convolutional network. The first level focuses on the torso, capturing the correctness of the torso rotation. The second level focuses on the inner parts of the limbs, capturing transitions between actions. The third level targets the outer parts of the limbs, capturing the coordination of the athlete's body movements.
Figure 4: Illustration of the Multi-modal Fusion Module. We propose a pose-guided attention into such module where the input pose and visual features share the same procedure segmentation information, ensuring consistency in their corresponding spatiotemporal information.
Figure 5: Visualization results of the FineDiving-Pose dataset. The left panel presents detailed 2D skeletal annotations covering the complete action sequence, including takeoff, flight, and entry. While the right panel presents a comparative visualization of actions '407C' and '307C' using the original HRNet and our proposed method. Compared with the standard HRNet approach, our method demonstrates accuracy in handling extreme body postures.
...and 2 more figures

Action Quality Assessment via Hierarchical Pose-guided Multi-stage Contrastive Regression

TL;DR

Abstract

Action Quality Assessment via Hierarchical Pose-guided Multi-stage Contrastive Regression

Authors

TL;DR

Abstract

Table of Contents

Figures (7)