Table of Contents
Fetching ...

Multi-Stage Contrastive Regression for Action Quality Assessment

Qi An, Mengshi Qi, Huadong Ma

TL;DR

The paper tackles video action quality assessment (AQA) by leveraging stage-level structure through multi-stage segmentation. It introduces Multi-stage Contrastive Regression (MCoRe), a three-part pipeline with a feature extractor, a procedure segmentation module, and a regressor, producing a relative score via $\hat{S_q} = S_e + \Upsilon(V_q;V_e)$. A stage-wise contrastive loss $\mathcal{L}_{cont}$ enforces consistency across corresponding stages and discourages cross-stage confusion, improving segmentation and scoring. On the FineDiving dataset, MCoRe delivers state-of-the-art performance in SRCC and R-l2 while maintaining strong efficiency, achieving significantly fewer FLOPs and parameters than prior methods. The approach demonstrates that stage-aligned contrastive regression is effective for fine-grained AQA and offers practical benefits for real-time or resource-constrained scenarios.

Abstract

In recent years, there has been growing interest in the video-based action quality assessment (AQA). Most existing methods typically solve AQA problem by considering the entire video yet overlooking the inherent stage-level characteristics of actions. To address this issue, we design a novel Multi-stage Contrastive Regression (MCoRe) framework for the AQA task. This approach allows us to efficiently extract spatial-temporal information, while simultaneously reducing computational costs by segmenting the input video into multiple stages or procedures. Inspired by the graph contrastive learning, we propose a new stage-wise contrastive learning loss function to enhance performance. As a result, MCoRe demonstrates the state-of-the-art result so far on the widely-adopted fine-grained AQA dataset.

Multi-Stage Contrastive Regression for Action Quality Assessment

TL;DR

The paper tackles video action quality assessment (AQA) by leveraging stage-level structure through multi-stage segmentation. It introduces Multi-stage Contrastive Regression (MCoRe), a three-part pipeline with a feature extractor, a procedure segmentation module, and a regressor, producing a relative score via . A stage-wise contrastive loss enforces consistency across corresponding stages and discourages cross-stage confusion, improving segmentation and scoring. On the FineDiving dataset, MCoRe delivers state-of-the-art performance in SRCC and R-l2 while maintaining strong efficiency, achieving significantly fewer FLOPs and parameters than prior methods. The approach demonstrates that stage-aligned contrastive regression is effective for fine-grained AQA and offers practical benefits for real-time or resource-constrained scenarios.

Abstract

In recent years, there has been growing interest in the video-based action quality assessment (AQA). Most existing methods typically solve AQA problem by considering the entire video yet overlooking the inherent stage-level characteristics of actions. To address this issue, we design a novel Multi-stage Contrastive Regression (MCoRe) framework for the AQA task. This approach allows us to efficiently extract spatial-temporal information, while simultaneously reducing computational costs by segmenting the input video into multiple stages or procedures. Inspired by the graph contrastive learning, we propose a new stage-wise contrastive learning loss function to enhance performance. As a result, MCoRe demonstrates the state-of-the-art result so far on the widely-adopted fine-grained AQA dataset.
Paper Structure (12 sections, 14 equations, 2 figures, 3 tables)

This paper contains 12 sections, 14 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: The overview of MCoRe Network. The input video to be tested is denoted as $\mathit{V_q}$, and $\mathit{V_e}$ is the selected exemplar from an existing dataset. Features are separately extracted from $\mathit{V_q}$ and $\mathit{V_e}$, and then divided into K stages. Considering the utilization of features from the same stages as a pair, we employ a decoder to obtain $\boldsymbol{f}_{rel}$ , which represents the difference between the two videos. Finally, relative scores are regressed, and added to score $\mathit{S_e}$, resulting in the predicted score $\hat{\mathit{S_q}}$.
  • Figure 2: Visualized examples of the predicted results. Input a test video as query and randomly select videos from dataset with the same diving type number (e.g.,"407C") as exemplars.