Simulating Tracking Data to Advance Sports Analytics Research
David Radke, Kyle Tilbury
TL;DR
This paper addresses the scarcity of publicly accessible high-resolution tracking data in invasion sports by introducing a pipeline that generates synthetic tracking data from the Google Research Football environment. It defines a headless 3D-state schema mirroring real tracking data, stores 23 entities per timestep, and provides tooling to extract events and stints. The authors demonstrate two modeling tasks on the simulated data: $xG$ estimation from shot features and a tracking-based pitch control model, illustrating that real-tracking-based analytics can be supported with synthetic data. They release a publicly accessible dataset of 3,000 simulated games, accompanying code, and a demo video, offering a practical path for AI and sports analytics research when public data are scarce.
Abstract
Advanced analytics have transformed how sports teams operate, particularly in episodic sports like baseball. Their impact on continuous invasion sports, such as soccer and ice hockey, has been limited due to increased game complexity and restricted access to high-resolution game tracking data. In this demo, we present a method to collect and utilize simulated soccer tracking data from the Google Research Football environment to support the development of models designed for continuous tracking data. The data is stored in a schema that is representative of real tracking data and we provide processes that extract high-level features and events. We include examples of established tracking data models to showcase the efficacy of the simulated data. We address the scarcity of publicly available tracking data, providing support for research at the intersection of artificial intelligence and sports analytics.
