Foundation models for equation discovery in high energy physics
Manuel Morales-Alvarado
TL;DR
The paper addresses the challenge of deriving analytic expressions for angular observables in high energy physics by employing foundation models for symbolic regression (LLM-SR). It treats equations as Python programs guided by domain-informed priors and optimizes their structure and coefficients, enabling compact, interpretable parametrisations. The authors demonstrate the approach on lepton angular distributions, recovering the known law $\frac{d\sigma}{d\Omega}=\frac{\alpha^2}{4s}(1+\cos^2\theta)$, and on Drell–Yan angular coefficients, achieving compact expressions for $A_0(p_T)$, $A_4(p_T,|y|)$, and $A_4(p_T,|y|,m)$ with competitive losses, while exhibiting improved extrapolation robustness due to priors. This physics-informed, interpretable framework provides a valuable complement to existing symbolic-regression tools and can accelerate phenomenological analyses and analytic fast-inference across high-energy physics observables.
Abstract
Foundation models, large machine learning models trained on broad, multimodal datasets, have been gaining increasing attention in scientific applications due to their strong performance on diverse downstream tasks. Large Language Models (LLMs), a prominent instance of foundation models, have achieved remarkable success in tasks such as text and image generation. In this work, we investigate their potential for equation discovery in high energy physics, focusing on symbolic regression. We apply the LLM-SR methodology both to benchmark problems of equation recovery in lepton angular distributions and to the discovery of functional forms for angular coefficients in electroweak boson production at the Large Hadron Collider, observables of high phenomenological relevance for which no closed-form expressions are known from first principles. Our results demonstrate that LLM-SR can uncover compact, accurate, and interpretable equations across in-domain and out-of-domain kinematic regions, effectively incorporating embedded scientific knowledge and offering a promising new approach to equation discovery in high energy physics.
