Reducing Systematic Bias in Machine Learning Applications to Signal Extraction in High-Energy Nuclear Physics
Yan Wang, Rangrong Ma, Kaifeng Shen, Zebo Tang, Wangmei Zha
TL;DR
This paper tackles systematic biases that arise when ML classifiers are trained on imperfect detector simulations for signal extraction in high-energy nuclear physics. It introduces two distribution-matching corrections, CDF mapping and shift-and-scale, to align simulated feature distributions with real data while preserving inter-feature correlations. The methods are validated with a J/psi yield analysis in 200 GeV Ru+Ru and Zr+Zr collisions from STAR, achieving a ROC AUC of about 0.91 and an optimal operating point near a BDT score of 0.7, yielding substantial gains in signal significance compared with preselection or straight-cut approaches. Self-consistency checks demonstrate that the corrections produce stable efficiency calibrations and agreement between simulated training and real data, highlighting the robustness and broad applicability of the approach to ML-based analyses in high-energy physics.
Abstract
Machine learning techniques are increasingly being applied in high-energy nuclear physics data analysis thanks to their outstanding performance. One key challenge in such applications is the construction of training samples that can accurately represent real data. Training samples are typically generated through detector simulations, but discrepancies between simulated and real data can lead to degradation in machine learning performance and systematic biases in the results. This paper introduces two methods: i) cumulative distribution function mapping and ii) shift-and-scale, to align simulated signals with real data, which can aid in eliminating the aforementioned issues. We use the J/$ψ$ yield measurement in 200 GeV Ru+Ru and Zr+Zr collisions with the STAR experiment as an example to demonstrate the application and effectiveness of the proposed methods.
