Predicting soccer matches with complex networks and machine learning

Eduardo Alves Baratela; Felipe Jordão Xavier; Thomas Peron; Paulino Ribeiro Villas-Boas; Francisco Aparecido Rodrigues

Predicting soccer matches with complex networks and machine learning

Eduardo Alves Baratela, Felipe Jordão Xavier, Thomas Peron, Paulino Ribeiro Villas-Boas, Francisco Aparecido Rodrigues

TL;DR

This study highlights the use of complex networks as an alternative tool for predicting soccer match outcomes and shows how the combination of structural analysis of passing networks with match statistical data can provide deeper insights into the game patterns and strategies used by teams.

Abstract

Soccer attracts the attention of many researchers and professionals in the sports industry. Therefore, the incorporation of science into the sport is constantly growing, with increasing investments in performance analysis and sports prediction industries. This study aims to (i) highlight the use of complex networks as an alternative tool for predicting soccer match outcomes, and (ii) show how the combination of structural analysis of passing networks with match statistical data can provide deeper insights into the game patterns and strategies used by teams. In order to do so, complex network metrics and match statistics were used to build machine learning models that predict the wins and losses of soccer teams in different leagues. The results showed that models based on passing networks were as effective as ``traditional'' models, which use general match statistics. Another finding was that by combining both approaches, more accurate models were obtained than when they were used separately, demonstrating that the fusion of such approaches can offer a deeper understanding of game patterns, allowing the comprehension of tactics employed by teams relationships between players, their positions, and interactions during matches. It is worth mentioning that both network metrics and match statistics were important and impactful for the mixed model. Furthermore, the use of networks with a lower granularity of temporal evolution (such as creating a network for each half of the match) performed better than a single network for the entire game.

Predicting soccer matches with complex networks and machine learning

TL;DR

Abstract

Paper Structure (17 sections, 6 equations, 8 figures, 3 tables)

This paper contains 17 sections, 6 equations, 8 figures, 3 tables.

Introduction
Methodology
ETL / Data Collecting
Construction of passing networks
Measures for network characterization
Analysis of team performance
Machine learning methods
Results
Tournament Clustering
Complex network results
Comparing predictive models with and without complex network metrics
Combining approaches
Feature Importance
Tournament-based analysis
Conclusion
...and 2 more sections

Figures (8)

Figure 1: Example of a match between Barcelona and Real Madrid, for which four player passing networks were constructed. In the upper part, the networks of Real Madrid for the first and second half (left and right, respectively). In the lower part, the networks of Barcelona for the first and second half, respectively. The larger the area of the node circle, the higher the degree (sum of both in-degree and out-degree) of that node, indicating more involvement in passes by that player. It is interesting to observe how the networks change from team to team in terms of their structure and priorities on certain players, indicating changes in their characteristics and strategies throughout the game.
Figure 2: Metrics ($y$-axis) in relation to the rankings of Premier League teams ($x$-axis). The values of "r" and "p" represent the Pearson correlation between the metrics and the teams' rankings in the championship and their respective p-values, respectively.
Figure 3: The Elbow and Silhouette methods with a value of $k=3$. Figures (a) and (b) correspond to a dataset without variable selection, while Figures (c) and (d) correspond to the dataset with the use of PCA.
Figure 4: This graph shows the explained variability of each principal component generated by the application of Principal Component Analysis (PCA) technique. The explained variability is expressed in terms of percentage, representing the amount of data variation that is explained by each component. The bar graph indicates the explained variability of each individual component, while the blue line represents the cumulative variability, i.e., the cumulative sum of the explained variability up to the current component.
Figure 5: Graphs showing the performance of the different models in relation to the Precision-Recall curve on the left and the ROC-AUC curve on the right.
...and 3 more figures

Predicting soccer matches with complex networks and machine learning

TL;DR

Abstract

Predicting soccer matches with complex networks and machine learning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)