Table of Contents
Fetching ...

1st Place Solution to the 1st SkatingVerse Challenge

Tao Sun, Yuanzi Fu, Kaicheng Yang, Jian Wu, Ziyong Feng

TL;DR

This work tackles fine-grained action recognition in long, continuous figure skating videos. It combines ROI cropping via DINO with three model streams—Unmasked Teacher, UniformerV2, and ViTPose-InfoGCN—leveraging both visual and skeletal cues. A logit-based ensemble yields a top leaderboard score of $95.73$, outperforming single-model baselines and demonstrating the effectiveness of ROI preprocessing, multi-modal pretraining, and skeleton-based features for sports video analysis. The approach highlights the practical potential of integrating diverse video foundations and pose information for robust action recognition in complex, real-world datasets.

Abstract

This paper presents the winning solution for the 1st SkatingVerse Challenge. We propose a method that involves several steps. To begin, we leverage the DINO framework to extract the Region of Interest (ROI) and perform precise cropping of the raw video footage. Subsequently, we employ three distinct models, namely Unmasked Teacher, UniformerV2, and InfoGCN, to capture different aspects of the data. By ensembling the prediction results based on logits, our solution attains an impressive leaderboard score of 95.73%.

1st Place Solution to the 1st SkatingVerse Challenge

TL;DR

This work tackles fine-grained action recognition in long, continuous figure skating videos. It combines ROI cropping via DINO with three model streams—Unmasked Teacher, UniformerV2, and ViTPose-InfoGCN—leveraging both visual and skeletal cues. A logit-based ensemble yields a top leaderboard score of , outperforming single-model baselines and demonstrating the effectiveness of ROI preprocessing, multi-modal pretraining, and skeleton-based features for sports video analysis. The approach highlights the practical potential of integrating diverse video foundations and pose information for robust action recognition in complex, real-world datasets.

Abstract

This paper presents the winning solution for the 1st SkatingVerse Challenge. We propose a method that involves several steps. To begin, we leverage the DINO framework to extract the Region of Interest (ROI) and perform precise cropping of the raw video footage. Subsequently, we employ three distinct models, namely Unmasked Teacher, UniformerV2, and InfoGCN, to capture different aspects of the data. By ensembling the prediction results based on logits, our solution attains an impressive leaderboard score of 95.73%.
Paper Structure (12 sections, 1 equation, 2 figures, 3 tables)

This paper contains 12 sections, 1 equation, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Examples of figure skating actions: A (Axel); Lo (Loop); F (Flip); CSp (Camel Spin); Lz (Lutz); NB (No Basic).
  • Figure 2: The overall architecture of our solution.