Table of Contents
Fetching ...

From Broadcast to Minimap: Achieving State-of-the-Art SoccerNet Game State Reconstruction

Vladimir Golovkin, Nikolay Nemtsev, Vasyl Shandyba, Oleg Udin, Nikita Kasatkin, Pavel Kononov, Anton Afanasiev, Sergey Ulasen, Andrei Boiarov

TL;DR

The paper tackles the problem of reconstructing accurate game-state information from single-camera football broadcasts, including player positions, roles, teams, and jersey numbers, and presents a modular pipeline to output world-coordinate trajectories suitable for minimap representations. It fuses a fine-tuned object detector (YOLOv5m), a SegFormer–based camera parameter estimator with Field Keypoints refinement, and a DeepSORT-based tracker augmented with ReID, TeamID, and jersey-number recognition, followed by a multi-stage post-processing step to merge fragmented tracklets. Key contributions include the SegFormer-based camera parameter regression with keypoint-based refinement, a five-cluster team-detection scheme, a robust post-processing pipeline that significantly reduces tracklet fragmentation, and real-time performance on consumer hardware, culminating in a GS-HOTA score of 63.81 and first place in SoccerNet GSR 2024. The work demonstrates strong gains from integrated detection, localization, and identity modeling, enabling reliable minimap-based game state reconstruction with practical implications for coaching analytics and tactical decision-making in football.

Abstract

Game State Reconstruction (GSR), a critical task in Sports Video Understanding, involves precise tracking and localization of all individuals on the football field-players, goalkeepers, referees, and others - in real-world coordinates. This capability enables coaches and analysts to derive actionable insights into player movements, team formations, and game dynamics, ultimately optimizing training strategies and enhancing competitive advantage. Achieving accurate GSR using a single-camera setup is highly challenging due to frequent camera movements, occlusions, and dynamic scene content. In this work, we present a robust end-to-end pipeline for tracking players across an entire match using a single-camera setup. Our solution integrates a fine-tuned YOLOv5m for object detection, a SegFormer-based camera parameter estimator, and a DeepSORT-based tracking framework enhanced with re-identification, orientation prediction, and jersey number recognition. By ensuring both spatial accuracy and temporal consistency, our method delivers state-of-the-art game state reconstruction, securing first place in the SoccerNet Game State Reconstruction Challenge 2024 and significantly outperforming competing methods.

From Broadcast to Minimap: Achieving State-of-the-Art SoccerNet Game State Reconstruction

TL;DR

The paper tackles the problem of reconstructing accurate game-state information from single-camera football broadcasts, including player positions, roles, teams, and jersey numbers, and presents a modular pipeline to output world-coordinate trajectories suitable for minimap representations. It fuses a fine-tuned object detector (YOLOv5m), a SegFormer–based camera parameter estimator with Field Keypoints refinement, and a DeepSORT-based tracker augmented with ReID, TeamID, and jersey-number recognition, followed by a multi-stage post-processing step to merge fragmented tracklets. Key contributions include the SegFormer-based camera parameter regression with keypoint-based refinement, a five-cluster team-detection scheme, a robust post-processing pipeline that significantly reduces tracklet fragmentation, and real-time performance on consumer hardware, culminating in a GS-HOTA score of 63.81 and first place in SoccerNet GSR 2024. The work demonstrates strong gains from integrated detection, localization, and identity modeling, enabling reliable minimap-based game state reconstruction with practical implications for coaching analytics and tactical decision-making in football.

Abstract

Game State Reconstruction (GSR), a critical task in Sports Video Understanding, involves precise tracking and localization of all individuals on the football field-players, goalkeepers, referees, and others - in real-world coordinates. This capability enables coaches and analysts to derive actionable insights into player movements, team formations, and game dynamics, ultimately optimizing training strategies and enhancing competitive advantage. Achieving accurate GSR using a single-camera setup is highly challenging due to frequent camera movements, occlusions, and dynamic scene content. In this work, we present a robust end-to-end pipeline for tracking players across an entire match using a single-camera setup. Our solution integrates a fine-tuned YOLOv5m for object detection, a SegFormer-based camera parameter estimator, and a DeepSORT-based tracking framework enhanced with re-identification, orientation prediction, and jersey number recognition. By ensuring both spatial accuracy and temporal consistency, our method delivers state-of-the-art game state reconstruction, securing first place in the SoccerNet Game State Reconstruction Challenge 2024 and significantly outperforming competing methods.

Paper Structure

This paper contains 20 sections, 8 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: The overall pipeline is divided into three main stages. In the raw tracking stage, initial tracks are generated, and information about team embeddings and jersey numbers is estimated for each player. During the team detection stage, the previously collected information is used to assign player tracks to their respective teams. Finally, postprocessing is applied to reduce the number of resulting tracks by merging raw tracking results.
  • Figure 2: Raw tracking stage performs object detection, pitch localization, collects information about players teams required on consequent stages, Re-ID embeddings, jersey numbers and then merges all collected data into preliminary object tracks using the DeepSort-based tracking
  • Figure 3: Team Detection Process. (a) Frames are clustered into three main groups: the two largest clusters (left and right teams) and the referee cluster. (b) Goalkeeper detection is performed separately by identifying athletes inside the penalty area and clustering them based on embeddings.
  • Figure 4: Camera Parameters Model. This figure illustrates the architecture of our custom SegFormer-based camera parameter estimator. The model consists of an encoder-decoder structure, where the encoder is based on the SegFormer architecture and the decoder includes two heads: one for predicting camera parameters (position, orientation, and field of view) and another for generating UV heatmaps.
  • Figure 5: The pipeline estimates camera parameters by combining a custom SegFormer model for initial predictions and a ResNet50-based segmentation for keypoint detection. The parameters are refined using keypoint alignment to obtain the final camera pose.
  • ...and 3 more figures