Table of Contents
Fetching ...

See It Before You Grab It: Deep Learning-based Action Anticipation in Basketball

Arnau Barrera Roy, Albert Clapés Sintes

TL;DR

This paper introduces action anticipation in basketball by predicting which team will secure possession after a missed shot, leveraging a self-curated NBA Rebounds dataset with over 100,000 videos and 2,000 timestamp-annotated rebounds. It proposes a Transformer-Encoder Anticipation Model (TEAM) built on an X3D_m backbone to handle online anticipation, and compares it to a strong baseline, while exploring auxiliary tasks like action classification and action spotting. The study demonstrates the feasibility and challenges of anticipating rebounds, provides extensive offline and online experiments, and analyzes human versus AI performance, interpretability, and data-augmentation strategies. The results offer insights into predictive modeling for dynamic multi-agent sports and pave the way for real-time broadcasting tools and post-game analysis, while highlighting avenues for future improvements such as ball-tracking cues and uncertainty-aware heads.

Abstract

Computer vision and video understanding have transformed sports analytics by enabling large-scale, automated analysis of game dynamics from broadcast footage. Despite significant advances in player and ball tracking, pose estimation, action localization, and automatic foul recognition, anticipating actions before they occur in sports videos has received comparatively little attention. This work introduces the task of action anticipation in basketball broadcast videos, focusing on predicting which team will gain possession of the ball following a shot attempt. To benchmark this task, a new self-curated dataset comprising 100,000 basketball video clips, over 300 hours of footage, and more than 2,000 manually annotated rebound events is presented. Comprehensive baseline results are reported using state-of-the-art action anticipation methods, representing the first application of deep learning techniques to basketball rebound prediction. Additionally, two complementary tasks, rebound classification and rebound spotting, are explored, demonstrating that this dataset supports a wide range of video understanding applications in basketball, for which no comparable datasets currently exist. Experimental results highlight both the feasibility and inherent challenges of anticipating rebounds, providing valuable insights into predictive modeling for dynamic multi-agent sports scenarios. By forecasting team possession before rebounds occur, this work enables applications in real-time automated broadcasting and post-game analysis tools to support decision-making.

See It Before You Grab It: Deep Learning-based Action Anticipation in Basketball

TL;DR

This paper introduces action anticipation in basketball by predicting which team will secure possession after a missed shot, leveraging a self-curated NBA Rebounds dataset with over 100,000 videos and 2,000 timestamp-annotated rebounds. It proposes a Transformer-Encoder Anticipation Model (TEAM) built on an X3D_m backbone to handle online anticipation, and compares it to a strong baseline, while exploring auxiliary tasks like action classification and action spotting. The study demonstrates the feasibility and challenges of anticipating rebounds, provides extensive offline and online experiments, and analyzes human versus AI performance, interpretability, and data-augmentation strategies. The results offer insights into predictive modeling for dynamic multi-agent sports and pave the way for real-time broadcasting tools and post-game analysis, while highlighting avenues for future improvements such as ball-tracking cues and uncertainty-aware heads.

Abstract

Computer vision and video understanding have transformed sports analytics by enabling large-scale, automated analysis of game dynamics from broadcast footage. Despite significant advances in player and ball tracking, pose estimation, action localization, and automatic foul recognition, anticipating actions before they occur in sports videos has received comparatively little attention. This work introduces the task of action anticipation in basketball broadcast videos, focusing on predicting which team will gain possession of the ball following a shot attempt. To benchmark this task, a new self-curated dataset comprising 100,000 basketball video clips, over 300 hours of footage, and more than 2,000 manually annotated rebound events is presented. Comprehensive baseline results are reported using state-of-the-art action anticipation methods, representing the first application of deep learning techniques to basketball rebound prediction. Additionally, two complementary tasks, rebound classification and rebound spotting, are explored, demonstrating that this dataset supports a wide range of video understanding applications in basketball, for which no comparable datasets currently exist. Experimental results highlight both the feasibility and inherent challenges of anticipating rebounds, providing valuable insights into predictive modeling for dynamic multi-agent sports scenarios. By forecasting team possession before rebounds occur, this work enables applications in real-time automated broadcasting and post-game analysis tools to support decision-making.

Paper Structure

This paper contains 82 sections, 5 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: Example sequences of the two studied human actions: a) defensive rebound and b) offensive rebound.
  • Figure 2: Illustration of the different tasks solved in this project: a) Action Classification, b) Action Spotting, c) Offline Action Anticipation and d) Online Action Anticipation. The notation used in this figure ($V, t_A, \tau_a,$ etc.) will be later explained in detail in Section \ref{['sec:problem_def']}.
  • Figure 3: Dataset $D_{100\text{K}}$ statistics: a) Histogram showing the distribution of video durations along the hole dataset, b) histogram of the absolute timestamp distribution of the actions across the videos and c) histogram of the action's timestamp relative to the hole video duration.
  • Figure 4: Schematic view of the method TEAM, showing the dimensions of the input and output tensors for each module. The input to the model consists of a batch of videos, from which a 3D CNN backbone extracts local spatiotemporal features. Positional encoding (PE) is then added to these features, and a CLS token is appended. This tensor is subsequently passed through a double Transformer encoder layer, which enriches the CLS token. Finally, the enriched CLS token is fed into an MLP to produce the final prediction, represented as a per-class probability.
  • Figure 5: Illustration of different anticipation times ($\tau_a$) for the same play (OREB). Each subfigure shows the last frame that the model would see when trained at each $\tau_a$.
  • ...and 13 more figures