FineSkiing: A Fine-grained Benchmark for Skiing Action Quality Assessment
Yongji Zhang, Siqi Li, Yue Gao, Yu Jiang
TL;DR
This work introduces FineSkiing, the first fine-grained AQA dataset for aerial skiing with stage-wise scores and deduction annotations, enabling more interpretable and reliable action scoring. It also proposes JudgeMind, a stage-decoupled framework that segments each video into air, form, and landing, applies stage-specific feature extraction, and leverages a knowledge-based decoder to fuse deduction knowledge with action codes to predict stage scores. Experiments show state-of-the-art performance on FineSkiing and competitive results on FineDiving, with ablations highlighting the contributions of temporal segmentation, foreground/global features, and deduction knowledge. The combination of a rich, standards-aligned dataset and a stage-aware, knowledge-guided model advances robust, interpretable AQA suitable for professional judging contexts and broader temporal action assessment tasks.
Abstract
Action Quality Assessment (AQA) aims to evaluate and score sports actions, which has attracted widespread interest in recent years. Existing AQA methods primarily predict scores based on features extracted from the entire video, resulting in limited interpretability and reliability. Meanwhile, existing AQA datasets also lack fine-grained annotations for action scores, especially for deduction items and sub-score annotations. In this paper, we construct the first AQA dataset containing fine-grained sub-score and deduction annotations for aerial skiing, which will be released as a new benchmark. For the technical challenges, we propose a novel AQA method, named JudgeMind, which significantly enhances performance and reliability by simulating the judgment and scoring mindset of professional referees. Our method segments the input action video into different stages and scores each stage to enhance accuracy. Then, we propose a stage-aware feature enhancement and fusion module to boost the perception of stage-specific key regions and enhance the robustness to visual changes caused by frequent camera viewpoints switching. In addition, we propose a knowledge-based grade-aware decoder to incorporate possible deduction items as prior knowledge to predict more accurate and reliable scores. Experimental results demonstrate that our method achieves state-of-the-art performance.
