1st Place Solution to the 1st SkatingVerse Challenge
Tao Sun, Yuanzi Fu, Kaicheng Yang, Jian Wu, Ziyong Feng
TL;DR
This work tackles fine-grained action recognition in long, continuous figure skating videos. It combines ROI cropping via DINO with three model streams—Unmasked Teacher, UniformerV2, and ViTPose-InfoGCN—leveraging both visual and skeletal cues. A logit-based ensemble yields a top leaderboard score of $95.73$, outperforming single-model baselines and demonstrating the effectiveness of ROI preprocessing, multi-modal pretraining, and skeleton-based features for sports video analysis. The approach highlights the practical potential of integrating diverse video foundations and pose information for robust action recognition in complex, real-world datasets.
Abstract
This paper presents the winning solution for the 1st SkatingVerse Challenge. We propose a method that involves several steps. To begin, we leverage the DINO framework to extract the Region of Interest (ROI) and perform precise cropping of the raw video footage. Subsequently, we employ three distinct models, namely Unmasked Teacher, UniformerV2, and InfoGCN, to capture different aspects of the data. By ensembling the prediction results based on logits, our solution attains an impressive leaderboard score of 95.73%.
