Multi-Stage Contrastive Regression for Action Quality Assessment
Qi An, Mengshi Qi, Huadong Ma
TL;DR
The paper tackles video action quality assessment (AQA) by leveraging stage-level structure through multi-stage segmentation. It introduces Multi-stage Contrastive Regression (MCoRe), a three-part pipeline with a feature extractor, a procedure segmentation module, and a regressor, producing a relative score via $\hat{S_q} = S_e + \Upsilon(V_q;V_e)$. A stage-wise contrastive loss $\mathcal{L}_{cont}$ enforces consistency across corresponding stages and discourages cross-stage confusion, improving segmentation and scoring. On the FineDiving dataset, MCoRe delivers state-of-the-art performance in SRCC and R-l2 while maintaining strong efficiency, achieving significantly fewer FLOPs and parameters than prior methods. The approach demonstrates that stage-aligned contrastive regression is effective for fine-grained AQA and offers practical benefits for real-time or resource-constrained scenarios.
Abstract
In recent years, there has been growing interest in the video-based action quality assessment (AQA). Most existing methods typically solve AQA problem by considering the entire video yet overlooking the inherent stage-level characteristics of actions. To address this issue, we design a novel Multi-stage Contrastive Regression (MCoRe) framework for the AQA task. This approach allows us to efficiently extract spatial-temporal information, while simultaneously reducing computational costs by segmenting the input video into multiple stages or procedures. Inspired by the graph contrastive learning, we propose a new stage-wise contrastive learning loss function to enhance performance. As a result, MCoRe demonstrates the state-of-the-art result so far on the widely-adopted fine-grained AQA dataset.
