Fine-grained Action Analysis: A Multi-modality and Multi-task Dataset of Figure Skating
Sheng-Lan Liu, Yu-Ning Ding, Gang Yan, Si-Fan Zhang, Jin-Rong Zhang, Wen-Yue Chen, Xue-Hai Xu
TL;DR
MMFS addresses the gap in fine-grained action analysis by introducing a large-scale, multi-modality, multi-task dataset for figure skating. It combines RGB and skeleton data with independently defined spatial and temporal labels, and supports both action recognition and action quality assessment, leveraging a strong, expert-informed annotation workflow. The dataset features 11671 clips across 256 fine-grained categories (pare down to MMFS-63 for balanced evaluation) with long, variable durations and a hierarchical labeling scheme, challenging current models—especially in temporal analysis. Overall, MMFS demonstrates that skeleton-based representations better capture fine-grained motion cues and that temporal semantics pose substantial challenges, providing a rigorous benchmark to spur advances in multi-modality, fine-grained action analysis and quality assessment in sports.
Abstract
The fine-grained action analysis of the existing action datasets is challenged by insufficient action categories, low fine granularities, limited modalities, and tasks. In this paper, we propose a Multi-modality and Multi-task dataset of Figure Skating (MMFS) which was collected from the World Figure Skating Championships. MMFS, which possesses action recognition and action quality assessment, captures RGB, skeleton, and is collected the score of actions from 11671 clips with 256 categories including spatial and temporal labels. The key contributions of our dataset fall into three aspects as follows. (1) Independently spatial and temporal categories are first proposed to further explore fine-grained action recognition and quality assessment. (2) MMFS first introduces the skeleton modality for complex fine-grained action quality assessment. (3) Our multi-modality and multi-task dataset encourage more action analysis models. To benchmark our dataset, we adopt RGB-based and skeleton-based baseline methods for action recognition and action quality assessment.
