Table of Contents
Fetching ...

Robust semi-parametric signal detection in particle physics with classifiers decorrelated via optimal transport

Purvasha Chakravarti, Lucas Kania, Olaf Behnke, Mikael Kuusela, Larry Wasserman

TL;DR

The paper tackles robust signal detection in particle physics when background misspecification biases supervised classifiers. It introduces a three-step pipeline that decorrelates the classifier output from the invariant-mass protected variable via an optimal transport map, enabling safe signal enrichment and a semiparametric test on the protected variable. The authors derive efficient, semiparametric estimators for signal strength under known or parametric backgrounds and validate the approach with W-tagging and high-mass resonance experiments, showing reduced sculpting and improved power. The work demonstrates that post-processing decorrelation (CDOT) yields stable, robust, and more powerful tests, with practical implications for collider analyses and potential extensions to multivariate protections.

Abstract

Searches for signals of new physics in particle physics are usually done by training a supervised classifier to separate a signal model from the known Standard Model physics (also called the background model). However, even when the signal model is correct, systematic errors in the background model can influence supervised classifiers and might adversely affect the signal detection procedure. To tackle this problem, one approach is to use the (possibly misspecified) classifier only to perform a preliminary signal-enrichment step and then to carry out a signal detection test on the signal-rich sample. For this procedure to work, we need a classifier constrained to be decorrelated with one or more protected variables used for the signal-detection step. We do this by considering an optimal transport map of the classifier output that makes it independent of the protected variable(s) for the background. We then fit a semiparametric mixture model to the distribution of the protected variable after making cuts on the transformed classifier to detect the presence of a signal. We compare and contrast this decorrelation method with previous approaches, show that the decorrelation procedure is robust to moderate background misspecification, and analyze the power and validity of the signal detection test as a function of the cut on the classifier both with and without decorrelation. We conclude that decorrelation and signal enrichment help produce a stable, robust, valid, and more powerful test.

Robust semi-parametric signal detection in particle physics with classifiers decorrelated via optimal transport

TL;DR

The paper tackles robust signal detection in particle physics when background misspecification biases supervised classifiers. It introduces a three-step pipeline that decorrelates the classifier output from the invariant-mass protected variable via an optimal transport map, enabling safe signal enrichment and a semiparametric test on the protected variable. The authors derive efficient, semiparametric estimators for signal strength under known or parametric backgrounds and validate the approach with W-tagging and high-mass resonance experiments, showing reduced sculpting and improved power. The work demonstrates that post-processing decorrelation (CDOT) yields stable, robust, and more powerful tests, with practical implications for collider analyses and potential extensions to multivariate protections.

Abstract

Searches for signals of new physics in particle physics are usually done by training a supervised classifier to separate a signal model from the known Standard Model physics (also called the background model). However, even when the signal model is correct, systematic errors in the background model can influence supervised classifiers and might adversely affect the signal detection procedure. To tackle this problem, one approach is to use the (possibly misspecified) classifier only to perform a preliminary signal-enrichment step and then to carry out a signal detection test on the signal-rich sample. For this procedure to work, we need a classifier constrained to be decorrelated with one or more protected variables used for the signal-detection step. We do this by considering an optimal transport map of the classifier output that makes it independent of the protected variable(s) for the background. We then fit a semiparametric mixture model to the distribution of the protected variable after making cuts on the transformed classifier to detect the presence of a signal. We compare and contrast this decorrelation method with previous approaches, show that the decorrelation procedure is robust to moderate background misspecification, and analyze the power and validity of the signal detection test as a function of the cut on the classifier both with and without decorrelation. We conclude that decorrelation and signal enrichment help produce a stable, robust, valid, and more powerful test.
Paper Structure (27 sections, 9 theorems, 97 equations, 27 figures)

This paper contains 27 sections, 9 theorems, 97 equations, 27 figures.

Key Result

Lemma 1

Under model eq:model with known background and $\lambda \in (0,1)$, the plug-in estimator: is efficient. Furthermore, the following test: is an asymptotically valid test at level $\alpha$ for $\lambda \in [0,1)$ and $B(\mathop{\mathrm{\mathbb{C}}}\nolimits)>0$.

Figures (27)

  • Figure 1: Pictorial representation of signal detection in a mass spectrum.
  • Figure 2: Evidence of sculpting in the shape of the background distribution.
  • Figure 3: Flowchart of the signal detection pipeline.
  • Figure 4: Density plots of the invariant mass for the W-tagging data for different ranges of the classifier output ($h$) without any decorrelation.
  • Figure 5: Post-decorrelation density plots (top) and histograms (bottom) of the invariant mass for the W-tagging data, for different ranges of the transformed classifier ($T_M(h)$).
  • ...and 22 more figures

Theorems & Definitions (19)

  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Definition 1: Differentiable path or sub-model
  • Proposition 1
  • Definition 2: Tangent set
  • Definition 3: Differentiable functional
  • Definition 4: Efficient influence function
  • Lemma 4: Variance of efficient influence function lower bounds squared risk
  • Definition 5: Regular estimator
  • ...and 9 more