Exploring Patterns Behind Sports
Chang Liu, Chengcheng Ma, XuanQi Zhou
TL;DR
The paper tackles Olympic medal prediction by proposing a hybrid ARIMA-$\text{LSTM}$ framework that leverages embedding and PCA for rich feature representations and efficient computation. It combines linear ARIMA components with a nonlinear LSTM to capture both trends and complex dependencies, and enhances interval prediction via KNN in a learned embedding space. Beyond forecasting, it analyzes the Great Coach Effect with runs tests and contingency analyses, and introduces investment-index metrics to guide coaching strategies, supplemented by SHAP-based attribution for traditional advantages. The results show accurate medal predictions and meaningful insights for national committees, including influential sports, host effects, gender dynamics, and strategic coaching investments, underscoring the value of integrating traditional statistics with deep learning for sports forecasting and policy guidance.
Abstract
This paper presents a comprehensive framework for time series prediction using a hybrid model that combines ARIMA and LSTM. The model incorporates feature engineering techniques, including embedding and PCA, to transform raw data into a lower-dimensional representation while retaining key information. The embedding technique is used to convert categorical data into continuous vectors, facilitating the capture of complex relationships. PCA is applied to reduce dimensionality and extract principal components, enhancing model performance and computational efficiency. To handle both linear and nonlinear patterns in the data, the ARIMA model captures linear trends, while the LSTM model models complex nonlinear dependencies. The hybrid model is trained on historical data and achieves high accuracy, as demonstrated by low RMSE and MAE scores. Additionally, the paper employs the run test to assess the randomness of sequences, providing insights into the underlying patterns. Ablation studies are conducted to validate the roles of different components in the model, demonstrating the significance of each module. The paper also utilizes the SHAP method to quantify the impact of traditional advantages on the predicted results, offering a detailed understanding of feature importance. The KNN method is used to determine the optimal prediction interval, further enhancing the model's accuracy. The results highlight the effectiveness of combining traditional statistical methods with modern deep learning techniques for robust time series forecasting in Sports.
