Table of Contents
Fetching ...

Forecasting Events in Soccer Matches Through Language

Tiago Mendes-Neves, Luís Meireles, João Mendes-Moreira

TL;DR

The paper tackles the problem of forecasting the next event in soccer by treating matches as sequences of events and drawing inspiration from Large Language Models to build a single, language-based Large Events Model (LEM). It leverages the public WyScout dataset with ordinal-encoded event tokens and a token-by-token prediction paradigm, enabling end-to-end generation of entire event chains and scalable simulations for analytics pipelines. Experimental results show meaningful gains in predicting the next event type and improving spatial accuracy, while also enabling situational xG maps, momentum-like short-term probabilities, and long-term match outcome forecasts, with VAEP valuations broadly aligning to expected scoring opportunities. The work offers a scalable backbone for diverse analytics tasks in soccer and lays out clear avenues for future enhancements, such as richer contextual inputs and more advanced architectures to push predictive performance further.

Abstract

This paper introduces an approach to predicting the next event in a soccer match, a challenge bearing remarkable similarities to the problem faced by Large Language Models (LLMs). Unlike other methods that severely limit event dynamics in soccer, often abstracting from many variables or relying on a mix of sequential models, our research proposes a novel technique inspired by the methodologies used in LLMs. These models predict a complete chain of variables that compose an event, significantly simplifying the construction of Large Event Models (LEMs) for soccer. Utilizing deep learning on the publicly available WyScout dataset, the proposed approach notably surpasses the performance of previous LEM proposals in critical areas, such as the prediction accuracy of the next event type. This paper highlights the utility of LEMs in various applications, including match prediction and analytics. Moreover, we show that LEMs provide a simulation backbone for users to build many analytics pipelines, an approach opposite to the current specialized single-purpose models. LEMs represent a pivotal advancement in soccer analytics, establishing a foundational framework for multifaceted analytics pipelines through a singular machine-learning model.

Forecasting Events in Soccer Matches Through Language

TL;DR

The paper tackles the problem of forecasting the next event in soccer by treating matches as sequences of events and drawing inspiration from Large Language Models to build a single, language-based Large Events Model (LEM). It leverages the public WyScout dataset with ordinal-encoded event tokens and a token-by-token prediction paradigm, enabling end-to-end generation of entire event chains and scalable simulations for analytics pipelines. Experimental results show meaningful gains in predicting the next event type and improving spatial accuracy, while also enabling situational xG maps, momentum-like short-term probabilities, and long-term match outcome forecasts, with VAEP valuations broadly aligning to expected scoring opportunities. The work offers a scalable backbone for diverse analytics tasks in soccer and lays out clear avenues for future enhancements, such as richer contextual inputs and more advanced architectures to push predictive performance further.

Abstract

This paper introduces an approach to predicting the next event in a soccer match, a challenge bearing remarkable similarities to the problem faced by Large Language Models (LLMs). Unlike other methods that severely limit event dynamics in soccer, often abstracting from many variables or relying on a mix of sequential models, our research proposes a novel technique inspired by the methodologies used in LLMs. These models predict a complete chain of variables that compose an event, significantly simplifying the construction of Large Event Models (LEMs) for soccer. Utilizing deep learning on the publicly available WyScout dataset, the proposed approach notably surpasses the performance of previous LEM proposals in critical areas, such as the prediction accuracy of the next event type. This paper highlights the utility of LEMs in various applications, including match prediction and analytics. Moreover, we show that LEMs provide a simulation backbone for users to build many analytics pipelines, an approach opposite to the current specialized single-purpose models. LEMs represent a pivotal advancement in soccer analytics, establishing a foundational framework for multifaceted analytics pipelines through a singular machine-learning model.
Paper Structure (15 sections, 3 equations, 6 figures, 4 tables)

This paper contains 15 sections, 3 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: A schematic representation of our proposal. In blue, we have the set of inputs used to build the input vector, passed through the LEM model to infer the probabilities of each token. To make a prediction, the probabilities go through a sampler with restrictions to avoid hallucinations, i.e., predicting unrealistic values.
  • Figure 2: The probability of transitioning from current location x,y to the next location x,y. The pattern contains two behaviors: (1) the positive correlation between the current coordinates and the next coordinate, as the next event performed by the same team is expected to be close to the current event, and (2) a negative correlation caused by when the next event is performed by the opposite team, as the coordinate axis changes to the opposition's perspective.
  • Figure 3: The situational expected goals maps calculated across the different models. For each case, we simulated 1.000.000 shots for each input. Then, we calculate the percentage of shots leading to a goal for each location, which is used to plot the figures.
  • Figure 4: The visualization of a match momentum indicator built using the K=1 model. The data corresponds to the Real Madrid - Barcelona, December 23, 2017.
  • Figure 5: The in-game probabilities calculated using LEMs for the game Real Madrid - Barcelona, December 23, 2017. The second half starts with a balance in probabilities and shifts abruptly every time Barcelona scores a goal. The short-term fluctuations provoked by events in the match are also visible in the image.
  • ...and 1 more figures