FACTS: Fine-Grained Action Classification for Tactical Sports
Christopher Lai, Jason Mo, Haotian Xia, Yuan-fang Wang
TL;DR
This work tackles fine-grained action recognition in fast-paced close-combat sports by introducing FACTS, a transformer-based framework that operates directly on raw video data without pose estimation or body markers. Leveraging a VideoMAE-pretrained encoder-decoder architecture, FACTS achieves state-of-the-art accuracies of 90% for fencing and 83.25% for boxing, and it introduces publicly available datasets with detailed, clipped action annotations. The approach demonstrates robust spatiotemporal modeling of subtle movements and sets a new benchmark for sensor-free, high-speed sport analytics. The work also discusses limitations and avenues for future improvement, including hybridizing with pose-based cues and extending to real-time deployment and additional sports.
Abstract
Classifying fine-grained actions in fast-paced, close-combat sports such as fencing and boxing presents unique challenges due to the complexity, speed, and nuance of movements. Traditional methods reliant on pose estimation or fancy sensor data often struggle to capture these dynamics accurately. We introduce FACTS, a novel transformer-based approach for fine-grained action recognition that processes raw video data directly, eliminating the need for pose estimation and the use of cumbersome body markers and sensors. FACTS achieves state-of-the-art performance, with 90% accuracy on fencing actions and 83.25% on boxing actions. Additionally, we present a new publicly available dataset featuring 8 detailed fencing actions, addressing critical gaps in sports analytics resources. Our findings enhance training, performance analysis, and spectator engagement, setting a new benchmark for action classification in tactical sports.
