Bayes-xG: Player and Position Correction on Expected Goals (xG) using Bayesian Hierarchical Approach
Alexander Scholtes, Oktay Karakuş
TL;DR
This work investigates whether player identity and playing position alter shot-to-goal probabilities beyond traditional shot features by applying Bayesian hierarchical logistic regression to football xG data. It compares baseline frequentist xG and a StatsBomb benchmark, then builds three Bayes-xG variants to separate position and player effects, revealing that position effects largely disappear once detailed shot-context predictors are included, while player-specific xG adjustments persist across leagues. The study analyzes Premier League data and extends to La Liga and the Bundesliga, demonstrating cross-league generalization and highlighting the impact of prior choices on sampling efficiency. The results offer practical insights for scouting and performance evaluation by quantifying how individual players differ in finishing ability beyond contextual shot factors, and they discuss methodological implications for priors in complex hierarchical models.
Abstract
This study employs Bayesian methodologies to explore the influence of player or positional factors in predicting the probability of a shot resulting in a goal, measured by the expected goals (xG) metric. Utilising publicly available data from StatsBomb, Bayesian hierarchical logistic regressions are constructed, analysing approximately 10,000 shots from the English Premier League to ascertain whether positional or player-level effects impact xG. The findings reveal positional effects in a basic model that includes only distance to goal and shot angle as predictors, highlighting that strikers and attacking midfielders exhibit a higher likelihood of scoring. However, these effects diminish when more informative predictors are introduced. Nevertheless, even with additional predictors, player-level effects persist, indicating that certain players possess notable positive or negative xG adjustments, influencing their likelihood of scoring a given chance. The study extends its analysis to data from Spain's La Liga and Germany's Bundesliga, yielding comparable results. Additionally, the paper assesses the impact of prior distribution choices on outcomes, concluding that the priors employed in the models provide sound results but could be refined to enhance sampling efficiency for constructing more complex and extensive models feasibly.
