Explainable Reinforcement Learning for Formula One Race Strategy
Devin Thomas, Junqi Jiang, Avinash Kori, Aaron Russo, Steffen Winkler, Stuart Sale, Joseph McMillan, Francesco Belardinelli, Antonio Rago
TL;DR
This work introduces Race Strategy Reinforcement Learning (RSRL), a DRQN-based framework for optimizing Formula One race strategy under partial observability, evaluated on the 2023 Bahrain Grand Prix. The authors design a portable system architecture with a UnifiedRaceState/UnifiedRaceStrategy abstraction and a four-action pitstop space, enabling training on Monte Carlo race simulations and deployment with live data. RSRL outperforms fixed and Mercedes SOTA baselines, achieving an average finishing position of $P5.33$ compared with $P5.63$ and $P5.86$, and demonstrates generalisation across tracks through training on multiple tracks. To build trust, the approach integrates explainable AI techniques—TimeSHAP, VIPER, and counterfactual decision trees—providing feature-level attributions, faithful surrogate models, and actionable counterfactuals. The work also shows how increasing training diversity can improve performance on unseen tracks, facilitating practical deployment in real-world racing contexts.
Abstract
In Formula One, teams compete to develop their cars and achieve the highest possible finishing position in each race. During a race, however, teams are unable to alter the car, so they must improve their cars' finishing positions via race strategy, i.e. optimising their selection of which tyre compounds to put on the car and when to do so. In this work, we introduce a reinforcement learning model, RSRL (Race Strategy Reinforcement Learning), to control race strategies in simulations, offering a faster alternative to the industry standard of hard-coded and Monte Carlo-based race strategies. Controlling cars with a pace equating to an expected finishing position of P5.5 (where P1 represents first place and P20 is last place), RSRL achieves an average finishing position of P5.33 on our test race, the 2023 Bahrain Grand Prix, outperforming the best baseline of P5.63. We then demonstrate, in a generalisability study, how performance for one track or multiple tracks can be prioritised via training. Further, we supplement model predictions with feature importance, decision tree-based surrogate models, and decision tree counterfactuals towards improving user trust in the model. Finally, we provide illustrations which exemplify our approach in real-world situations, drawing parallels between simulations and reality.
