Table of Contents
Fetching ...

ViTs for Action Classification in Videos: An Approach to Risky Tackle Detection in American Football Practice Videos

Syed Ahsan Masud Zaidi, William Hsu, Scott Dietrich

Abstract

Early identification of hazardous actions in contact sports enables timely intervention and improves player safety. We present a method for detecting risky tackles in American football practice videos and introduce a substantially expanded dataset for this task. Our work contains 733 single-athlete-dummy tackle clips, each temporally localized around first point contact and labeled with a strike zone component of the standardized Assessment for Tackling Technique (SATT-3), extending prior work that reported 178 annotated videos. Using a Vision transformer-based model with imbalance-aware training, we obtain risky recall of 0.67 and Risky F1 of 0.59 under crossvalidation. Relative to the previous baseline in a smaller subset (risky recall of 0.58; Risky F1 0.56 ), our approach improves risky recall by more than 8% points on a much larger dataset. These results indicate that the vision transformer-based video analysis, coupled with careful handling of class imbalance, can reliably detect rare but safety-critical tackling patterns, offering a practical pathway toward coach-centered injury prevention tools.

ViTs for Action Classification in Videos: An Approach to Risky Tackle Detection in American Football Practice Videos

Abstract

Early identification of hazardous actions in contact sports enables timely intervention and improves player safety. We present a method for detecting risky tackles in American football practice videos and introduce a substantially expanded dataset for this task. Our work contains 733 single-athlete-dummy tackle clips, each temporally localized around first point contact and labeled with a strike zone component of the standardized Assessment for Tackling Technique (SATT-3), extending prior work that reported 178 annotated videos. Using a Vision transformer-based model with imbalance-aware training, we obtain risky recall of 0.67 and Risky F1 of 0.59 under crossvalidation. Relative to the previous baseline in a smaller subset (risky recall of 0.58; Risky F1 0.56 ), our approach improves risky recall by more than 8% points on a much larger dataset. These results indicate that the vision transformer-based video analysis, coupled with careful handling of class imbalance, can reliably detect rare but safety-critical tackling patterns, offering a practical pathway toward coach-centered injury prevention tools.

Paper Structure

This paper contains 15 sections, 5 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Sample original (left) vs. augmented (right) frames for Run$_8$.
  • Figure 2: Temporal localization: Extracting tackle event from raw videos.
  • Figure 3: Pipeline overview: first-contact localization with fixed-window trimming, Taguchi $L_{18}$-guided augmentation, stratified 5-fold cross-validation, and ViViT training for Risky Tackle Detection
  • Figure 4: Performance heatmap showing mean scores across 5-fold cross-validation for all Taguchi configurations and supplementary runs. Rows correspond to evaluation metrics and columns to experimental runs. Cell values represent mean scores (0-1) computed using per-fold operating thresholds selected to maximize macro-F1. Black boxes highlight the best-performing configuration for each metric. $Run_{15}$ achieves optimal performance on both critical metrics: risky recall (0.67) and risky F1 (0.59).
  • Figure 5: Detailed performance comparison across evaluation metrics for selected augmentation configurations. Run 15 (highlighted) demonstrates superior risky-class detection (recall = 0.67, F1 = 0.59) while maintaining balanced performance across metrics. The supplementary runs (original imbalanced and duplicated baseline) serve as reference points, illustrating the effectiveness of systematic augmentation design over naive class balancing.
  • ...and 1 more figures