Table of Contents
Fetching ...

Reducing Systematic Bias in Machine Learning Applications to Signal Extraction in High-Energy Nuclear Physics

Yan Wang, Rangrong Ma, Kaifeng Shen, Zebo Tang, Wangmei Zha

TL;DR

This paper tackles systematic biases that arise when ML classifiers are trained on imperfect detector simulations for signal extraction in high-energy nuclear physics. It introduces two distribution-matching corrections, CDF mapping and shift-and-scale, to align simulated feature distributions with real data while preserving inter-feature correlations. The methods are validated with a J/psi yield analysis in 200 GeV Ru+Ru and Zr+Zr collisions from STAR, achieving a ROC AUC of about 0.91 and an optimal operating point near a BDT score of 0.7, yielding substantial gains in signal significance compared with preselection or straight-cut approaches. Self-consistency checks demonstrate that the corrections produce stable efficiency calibrations and agreement between simulated training and real data, highlighting the robustness and broad applicability of the approach to ML-based analyses in high-energy physics.

Abstract

Machine learning techniques are increasingly being applied in high-energy nuclear physics data analysis thanks to their outstanding performance. One key challenge in such applications is the construction of training samples that can accurately represent real data. Training samples are typically generated through detector simulations, but discrepancies between simulated and real data can lead to degradation in machine learning performance and systematic biases in the results. This paper introduces two methods: i) cumulative distribution function mapping and ii) shift-and-scale, to align simulated signals with real data, which can aid in eliminating the aforementioned issues. We use the J/$ψ$ yield measurement in 200 GeV Ru+Ru and Zr+Zr collisions with the STAR experiment as an example to demonstrate the application and effectiveness of the proposed methods.

Reducing Systematic Bias in Machine Learning Applications to Signal Extraction in High-Energy Nuclear Physics

TL;DR

This paper tackles systematic biases that arise when ML classifiers are trained on imperfect detector simulations for signal extraction in high-energy nuclear physics. It introduces two distribution-matching corrections, CDF mapping and shift-and-scale, to align simulated feature distributions with real data while preserving inter-feature correlations. The methods are validated with a J/psi yield analysis in 200 GeV Ru+Ru and Zr+Zr collisions from STAR, achieving a ROC AUC of about 0.91 and an optimal operating point near a BDT score of 0.7, yielding substantial gains in signal significance compared with preselection or straight-cut approaches. Self-consistency checks demonstrate that the corrections produce stable efficiency calibrations and agreement between simulated training and real data, highlighting the robustness and broad applicability of the approach to ML-based analyses in high-energy physics.

Abstract

Machine learning techniques are increasingly being applied in high-energy nuclear physics data analysis thanks to their outstanding performance. One key challenge in such applications is the construction of training samples that can accurately represent real data. Training samples are typically generated through detector simulations, but discrepancies between simulated and real data can lead to degradation in machine learning performance and systematic biases in the results. This paper introduces two methods: i) cumulative distribution function mapping and ii) shift-and-scale, to align simulated signals with real data, which can aid in eliminating the aforementioned issues. We use the J/ yield measurement in 200 GeV Ru+Ru and Zr+Zr collisions with the STAR experiment as an example to demonstrate the application and effectiveness of the proposed methods.

Paper Structure

This paper contains 8 sections, 5 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Distribution of $1/\beta$ versus momentum for charged particles, where the dashed horizontal lines indicate the $1/\beta$ cuts for electron preselection.
  • Figure 2: Distribution of $n\sigma_{e}$ versus momentum for charged particles after applying the $\left|1/\beta - 1 \right|<0.025$ cut. The dashed curves correspond to the $n\sigma_{e}$ cuts used for electron preselection.
  • Figure 3: Distributions of $E_{0}/p$ for electrons (circles) and protons (solid line) within $4<p_{\rm T}<6$ GeV/$c$. The vertical dashed lines correspond to the $E_{0}/p$ cuts used for electron preselection.
  • Figure 4: Invariant mass distributions of J/$\psi$ candidates reconstructed via the dielectron channel, within $|y| < 1$ and $p_{\rm T} > 0.2$ GeV/$c$, for Ru+Ru and Zr+Zr collisions. Different panels correspond to different electron identification cuts.
  • Figure 5: Correlations between selected features and the electron-positron pair invariant mass for the background training sample.
  • ...and 8 more figures