Table of Contents
Fetching ...

Inferring Interpretable Models of Fragmentation Functions using Symbolic Regression

Nour Makke, Sanjay Chawla

TL;DR

This work tackles the challenge of obtaining interpretable fragmentation-function (FF) forms directly from experimental data using symbolic regression (SR). By applying a transformer-based SR model to COMPASS SIDIS multiplicities, the authors extract analytic FF-like expressions, with the top univariate form $f_{ ext{SR}}(z)= a(1-z)^{c}\, ext{exp}(-b z)$ closely resembling the Lund FF and describing data across species and phase space. The study demonstrates that SR can recover meaningful, human-interpretable functions from noisy measurements, performing well in univariate and limited bivariate contexts while revealing limitations in universal multi-dimensional parameterizations. These results suggest SR-derived FF forms could serve as data-driven parameterizations in global QCD fits, offering a pathway toward interpretable, physics-grounded machine learning in high-energy phenomenology.

Abstract

Machine learning is rapidly making its path into natural sciences, including high-energy physics. We present the first study that infers, directly from experimental data, a functional form of fragmentation functions. The latter represent a key ingredient to describe physical observables measured in high-energy physics processes that involve hadron production, and predict their values at different energy. Fragmentation functions can not be calculated in theory and have to be determined instead from data. Traditional approaches rely on global fits of experimental data using a pre-assumed functional form inspired from phenomenological models to learn its parameters. This novel approach uses a ML technique, namely symbolic regression, to learn an analytical model from measured charged hadron multiplicities. The function learned by symbolic regression resembles the Lund string function and describes the data well, thus representing a potential candidate for use in global FFs fits. This study represents an approach to follow in such QCD-related phenomenology studies and more generally in sciences.

Inferring Interpretable Models of Fragmentation Functions using Symbolic Regression

TL;DR

This work tackles the challenge of obtaining interpretable fragmentation-function (FF) forms directly from experimental data using symbolic regression (SR). By applying a transformer-based SR model to COMPASS SIDIS multiplicities, the authors extract analytic FF-like expressions, with the top univariate form closely resembling the Lund FF and describing data across species and phase space. The study demonstrates that SR can recover meaningful, human-interpretable functions from noisy measurements, performing well in univariate and limited bivariate contexts while revealing limitations in universal multi-dimensional parameterizations. These results suggest SR-derived FF forms could serve as data-driven parameterizations in global QCD fits, offering a pathway toward interpretable, physics-grounded machine learning in high-energy phenomenology.

Abstract

Machine learning is rapidly making its path into natural sciences, including high-energy physics. We present the first study that infers, directly from experimental data, a functional form of fragmentation functions. The latter represent a key ingredient to describe physical observables measured in high-energy physics processes that involve hadron production, and predict their values at different energy. Fragmentation functions can not be calculated in theory and have to be determined instead from data. Traditional approaches rely on global fits of experimental data using a pre-assumed functional form inspired from phenomenological models to learn its parameters. This novel approach uses a ML technique, namely symbolic regression, to learn an analytical model from measured charged hadron multiplicities. The function learned by symbolic regression resembles the Lund string function and describes the data well, thus representing a potential candidate for use in global FFs fits. This study represents an approach to follow in such QCD-related phenomenology studies and more generally in sciences.
Paper Structure (11 sections, 14 equations, 10 figures, 4 tables)

This paper contains 11 sections, 14 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Expression-tree structure of the equation $f(x)=x/2-x^2$. Internal (and root) nodes are dashed lines and terminal nodes are full solid lines. Edge lines connect between operators and their respective sibling(s).
  • Figure 2: "(Data-SR)/SR" comparison for $h^{+}$ and $h^{-}$ multiplicity values from COMPASS:2016xvm. SR here refers to equations independently learned in individual ($x,y$) bins. The bins where a set of points is missing refer to cases for which normalization factors are missing, e.g., $f^{h^{+}}_{\text{SR}}=(\cos(2z))^3/z$ for $0.03<x<0.04$ and $f^{h^{+}}_{\text{SR}}=(\cos(2z))^5/z$ for $0.18<x<0.4$ for $0.15<y<0.2$, $f^{h^{+}}_{\text{SR}}=\exp(-z)/z^2$ for $0.18<x<0.4$ and $0.2<y<0.3$, and $f^{h^{-}}_{\text{SR}}=\exp(-az)/z^2$ for $0.1<x<0.14$ and $0.3<y<0.5$.
  • Figure 3: Comparison of the fits of $h^{+}$ multiplicties COMPASS:2016xvm using the functions in Eq. \ref{['eq:srmodels']} where $\chi^2_i$ denotes the $\chi^2/\mathrm{ndf}$ values obtained using $g_i(z)$. The box delimits the first and the third quartiles, whereas the middle line represents the median. The bottom and top lines represent, respectively, the minimum and maximum values in the $\chi^2/\mathrm{ndf}$ values. Markers show the outliers (values significantly smaller or larger than median values).
  • Figure 4: Comparison between experimental data COMPASS:2016xvm and fits performed using the SR model $g_4(z)$ (Eq. \ref{['eq:srmodels']}) for positive hadron multiplicities, displayed as a function of $z$ in nine $x$ bins and five $y$ bins (staggered vertically by $\delta=0.3$ for clarity). Statistical uncertainties are considered in the fits, and $\chi^2/\text{ndf}$ values are summarized in Tab. \ref{['tab:chi2_pi_kp']} (top-left).
  • Figure 5: "(Data-fit)/fit" for the fit to the $h^{\pm}$ SIDIS multiplicities from COMPASS:2016xvm using $g_4(z)$.
  • ...and 5 more figures