FERA: A Pose-Based Semantic Pipeline for Automated Foil Fencing Refereeing
Ziwen Chen, Zhong Wang
TL;DR
This work tackles the challenge of turning broadcast foil fencing video into semantically grounded decisions by bridging pose-based perception with rule-based reasoning. It introduces FERA, a two-stage pipeline: FERA-MDT, a calibrated encoder-only transformer that converts 101-dimensional pose-kinematic tokens into multi-label footwork and blade-line predictions, and FERA-LM, a lightweight language model that reasons over these tokens to output textual referee decisions. The authors provide a dedicated dataset with frame-level annotations for 12 moves and 5 blade-line positions, and demonstrate strong performance—macro-F1 of $0.549 \pm 0.018$ on move/ blade recognition and $77.7\%$ end-to-end accuracy on referee priority—surpassing baselines like BiLSTM and TCN. This work showcases a reusable pattern for pose-based semantic grounding in two-person sports and highlights practical steps toward real-time decision support, including dynamic windowing, left-right canonicalization, and rule-grounded language reasoning.
Abstract
Many multimedia tasks map raw video into structured semantic representations for downstream decision-making. Sports officiating is a representative case, where fast, subtle interactions must be judged via symbolic rules. We present FERA (FEncing Referee Assistant), a pose-based framework that turns broadcast foil fencing video into action tokens and rule-grounded explanations. From monocular footage, FERA extracts 2D poses, converts them into a 101-dimensional kinematic representation, and applies an encoder-only transformer (FERA-MDT) to recognize per-fencer footwork, blade actions, and blade-line position. To obtain a consistent single-fencer representation for both athletes, FERA processes each clip and a horizontally flipped copy, yielding time-aligned left/right predictions without requiring a multi-person pose pipeline. A dynamic temporal windowing scheme enables inference on untrimmed pose tracks. These structured predictions serve as tokens for a language model (FERA-LM) that applies simplified right-of-way rules to generate textual decisions. On 1,734 clips (2,386 annotated actions), FERA-MDT achieves a macro-F1 of 0.549 under 5-fold cross-validation, outperforming BiLSTM and TCN baselines. Combined with FERA-LM, the full pipeline recovers referee priority with 77.7% accuracy on 969 exchanges. FERA provides a case-study benchmark for pose-based semantic grounding in a two-person sport and illustrates a general pipeline for connecting video understanding with rule-based reasoning.
