Table of Contents
Fetching ...

Foul prediction with estimated poses from soccer broadcast video

Jiale Fang, Calvin Yeung, Keisuke Fujii

TL;DR

This paper addresses predicting soccer fouls from broadcast video by fusing spatial-temporal cues from video, bounding boxes, bounding-box images, and pose information. The authors introduce FutureFoul, a four-branch CNN/GRU-based architecture that processes each modality and combines them via an MLP to foresee a foul one second ahead, using a dataset built from SoccerNet-v3 with 2,500 fouls and 2,500 non-fouls. They show that the full multi-modal model outperforms ablations and that pose and bbox inputs contribute to performance, though recall remains challenging. The work advances practical foul prediction for refereeing and safety, while highlighting data quality and pose-detection limitations and pointing to future improvements in datasets, tracking, and pose estimation.

Abstract

Recent advances in computer vision have made significant progress in tracking and pose estimation of sports players. However, there have been fewer studies on behavior prediction with pose estimation in sports, in particular, the prediction of soccer fouls is challenging because of the smaller image size of each player and of difficulty in the usage of e.g., the ball and pose information. In our research, we introduce an innovative deep learning approach for anticipating soccer fouls. This method integrates video data, bounding box positions, image details, and pose information by curating a novel soccer foul dataset. Our model utilizes a combination of convolutional and recurrent neural networks (CNNs and RNNs) to effectively merge information from these four modalities. The experimental results show that our full model outperformed the ablated models, and all of the RNN modules, bounding box position and image, and estimated pose were useful for the foul prediction. Our findings have important implications for a deeper understanding of foul play in soccer and provide a valuable reference for future research and practice in this area.

Foul prediction with estimated poses from soccer broadcast video

TL;DR

This paper addresses predicting soccer fouls from broadcast video by fusing spatial-temporal cues from video, bounding boxes, bounding-box images, and pose information. The authors introduce FutureFoul, a four-branch CNN/GRU-based architecture that processes each modality and combines them via an MLP to foresee a foul one second ahead, using a dataset built from SoccerNet-v3 with 2,500 fouls and 2,500 non-fouls. They show that the full multi-modal model outperforms ablations and that pose and bbox inputs contribute to performance, though recall remains challenging. The work advances practical foul prediction for refereeing and safety, while highlighting data quality and pose-detection limitations and pointing to future improvements in datasets, tracking, and pose estimation.

Abstract

Recent advances in computer vision have made significant progress in tracking and pose estimation of sports players. However, there have been fewer studies on behavior prediction with pose estimation in sports, in particular, the prediction of soccer fouls is challenging because of the smaller image size of each player and of difficulty in the usage of e.g., the ball and pose information. In our research, we introduce an innovative deep learning approach for anticipating soccer fouls. This method integrates video data, bounding box positions, image details, and pose information by curating a novel soccer foul dataset. Our model utilizes a combination of convolutional and recurrent neural networks (CNNs and RNNs) to effectively merge information from these four modalities. The experimental results show that our full model outperformed the ablated models, and all of the RNN modules, bounding box position and image, and estimated pose were useful for the foul prediction. Our findings have important implications for a deeper understanding of foul play in soccer and provide a valuable reference for future research and practice in this area.
Paper Structure (23 sections, 9 figures, 3 tables)

This paper contains 23 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Our foul prediction system (FutureFoul). Our method uses video, bbox, bbox image and pose information of 3 s duration to predict fouls in the future 1 s.
  • Figure 2: Video data example. Frames 1, 25, 50, and 75 are the time-ordered frames in the 3 s before the foul. Frame 85 and 95 are examples of foal happening within 1 s of the denoted foul time, which is not be used for our FutureFoul model training.
  • Figure 3: Bbox extraction. The bbox information extracted using ByteTrack zhang2022bytetrack (Before) was filtered to extract the five closest to the position of the soccer ball at the time of the foul (After).
  • Figure 4: Pose extraction. We used OCHuman zhang2019pose2seg to obtain pose data (After) corresponding to the previously obtained Bbox information (Before).
  • Figure 5: Bounding box image extraction. We obtain bbox image data (After) from the video data(Before).
  • ...and 4 more figures