Table of Contents
Fetching ...

Toward Model-Agnostic Detection of New Physics Using Data-Driven Signal Regions

Soheun Yi, John Alison, Mikael Kuusela

TL;DR

The paper tackles model-agnostic detection of new physics by exploiting the localization of potential signals in a high-dimensional feature space and proposing a data-driven SR selection strategy based on density-ratio analysis. It learns a compact event representation $\zeta(x)$ and constructs a density-ratio framework with $\\gamma(\\zeta) = p_{4b}(\\zeta)/p_{3b}(\\zeta)$ and its smoothed counterpart $\\widetilde{\\gamma}(\\zeta) = (p_{4b} * K)(\\zeta)/(p_{3b} * K)(\\zeta)$, enabling SR definition via the ratio $\\gamma(\\zeta)/\\widetilde{\\gamma}(\\zeta)$. To estimate these quantities without direct density estimation, it trains a classifier on augmented, noisy representations $(Z_{3b}+\\mathcal{E},0)$ vs $(Z_{4b}+\\mathcal{E},1)$ with $\\mathcal{E} \sim K$. On simulated $\\mathrm{HH} \rightarrow 4b$ data, the method yields SRs enriched in signal and is competitive with a domain-knowledge baseline, particularly when SRs must be small, highlighting its practical value for model-agnostic searches of new physics.

Abstract

In the search for new particles in high-energy physics, it is crucial to select the Signal Region (SR) in such a way that it is enriched with signal events if they are present. While most existing search methods set the region relying on prior domain knowledge, it may be unavailable for a completely novel particle that falls outside the current scope of understanding. We address this issue by proposing a method built upon a model-agnostic but often realistic assumption about the localized topology of the signal events, in which they are concentrated in a certain area of the feature space. Considering the signal component as a localized high-frequency feature, our approach employs the notion of a low-pass filter. We define the SR as an area which is most affected when the observed events are smeared with additive random noise. We overcome challenges in density estimation in the high-dimensional feature space by learning the density ratio of events that potentially include a signal to the complementary observation of events that closely resemble the target events but are free of any signals. By applying our method to simulated $\mathrm{HH} \rightarrow 4b$ events, we demonstrate that the method can efficiently identify a data-driven SR in a high-dimensional feature space in which a high portion of signal events concentrate.

Toward Model-Agnostic Detection of New Physics Using Data-Driven Signal Regions

TL;DR

The paper tackles model-agnostic detection of new physics by exploiting the localization of potential signals in a high-dimensional feature space and proposing a data-driven SR selection strategy based on density-ratio analysis. It learns a compact event representation and constructs a density-ratio framework with and its smoothed counterpart , enabling SR definition via the ratio . To estimate these quantities without direct density estimation, it trains a classifier on augmented, noisy representations vs with . On simulated data, the method yields SRs enriched in signal and is competitive with a domain-knowledge baseline, particularly when SRs must be small, highlighting its practical value for model-agnostic searches of new physics.

Abstract

In the search for new particles in high-energy physics, it is crucial to select the Signal Region (SR) in such a way that it is enriched with signal events if they are present. While most existing search methods set the region relying on prior domain knowledge, it may be unavailable for a completely novel particle that falls outside the current scope of understanding. We address this issue by proposing a method built upon a model-agnostic but often realistic assumption about the localized topology of the signal events, in which they are concentrated in a certain area of the feature space. Considering the signal component as a localized high-frequency feature, our approach employs the notion of a low-pass filter. We define the SR as an area which is most affected when the observed events are smeared with additive random noise. We overcome challenges in density estimation in the high-dimensional feature space by learning the density ratio of events that potentially include a signal to the complementary observation of events that closely resemble the target events but are free of any signals. By applying our method to simulated events, we demonstrate that the method can efficiently identify a data-driven SR in a high-dimensional feature space in which a high portion of signal events concentrate.
Paper Structure (5 sections, 4 equations, 4 figures)

This paper contains 5 sections, 4 equations, 4 figures.

Figures (4)

  • Figure 1: Depiction of the variables used to represent a particle jet. $p_T$ is the transverse momentum, $\phi$ is the azimuthal angle, and $\theta$ is the polar angle. The pseudorapidity is given by $\eta = -\log(\tan(\theta/2))$.
  • Figure 2: $P_{3b} = {\mathcal{N}}(1, 4^2)$, $P_{4b} = 0.95 B_{4b} + 0.05 S = 0.95 {\mathcal{N}}(-1, 4^2) + 0.05 {\mathcal{N}}(7, 0.5^2)$, and $K = {\mathcal{N}}(0, 2^2)$. Signal events (with distribution $S$) are concentrated around $\zeta = 7$ (dashed lines). Looking at the largest values of $\gamma$ (upper right) does not identify the center of signal events, but the ratio $\gamma / \widetilde{\gamma}$ (lower right) does.
  • Figure 3: Concentration of signal events in the SR for the $\mathrm{HH} \to 4b$ data. $x$ axis represents the proportion of $4b$ events in the SR (measured by $P_{4b}({\mathcal{X}}_s)$) and $y$ axis represents the proportion of signal events in the SR (measured by $S({\mathcal{X}}_s)$). As defined previously, $n$, $\epsilon$, and $\eta$ are the number of $3b$ events, the signal ratio, and the scale of the convolution kernel, respectively. Bold lines and shaded areas represent (mean) $\pm$ (standard deviation) measured through $10$ repeated experiments.
  • Figure 4: Comparison with the baseline SR provided by Bryant2018_search. Bold lines represent the average measured through $10$ repeated experiments, and the shaded region represents the standard deviation of our method (the baseline has an ignorable variance, which is hence omitted). While the baseline, which has access to a priori knowledge of the location of the signal, shows better performance when $P_{4b}({\mathcal{X}}_s)$ is large, the methods have comparable performance when $P_{4b}({\mathcal{X}}_s)$ is small.