Table of Contents
Fetching ...

Detecting Localized Density Anomalies in Multivariate Data via Coin-Flip Statistics

Sebastian Springer, Andre Scaffidi, Maximilian Autenrieth, Gabriella Contardo, Alessandro Laio, Roberto Trotta, Heikki Haario

TL;DR

EagleEye tackles the problem of detecting localized density differences between two multivariate datasets via a simple, fully unsupervised two-sample framework. It casts local neighbourhood composition as a coin-flip process, computing a per-point anomaly score $\Upsilon_i = \max_{K \le K_M} -\log \mathrm{pval}(B_{\mathrm{obs}}(i,K))$ and flagging potential anomalies, which are then refined by Iterative Density Equalization (IDE) and multimodal repêchage to isolate genuine local overdensities. An injection-based procedure estimates irreducible background and yields an unsupervised estimate of the local signal purity $\widehat{\frac{S_{\alpha}}{S_{\alpha}+B_{\alpha}}}$, enabling a global assessment via the aggregated anomaly sets. Demonstrations on synthetic Gaussian-density anomalies, LHC resonance detection, and climate-temperature data show EagleEye’s ability to detect tiny localized signals (e.g., 0.3\% at the LHC) while maintaining scalability and deterministic, parallelizable computation. The method is broadly applicable to fields ranging from high-energy physics to climate science, offering per-region localization, robust performance in high dimensions, and explicit estimates of background and signal composition without reliance on kernel-based models or pre-specified signal regions.

Abstract

Detecting localized density differences in multivariate data is a crucial task in computational science. Such anomalies can indicate a critical system failure, lead to a groundbreaking scientific discovery, or reveal unexpected changes in data distribution. We introduce EagleEye, an anomaly detection method to compare two multivariate datasets with the aim of identifying local density anomalies, namely over- or under-densities affecting only localised regions of the feature space. Anomalies are detected by modelling, for each point, the ordered sequence of its neighbours' membership label as a coin-flipping process and monitoring deviations from the expected behaviour of such process. A unique advantage of our method is its ability to provide an accurate, entirely unsupervised estimate of the local signal purity. We demonstrate its effectiveness through experiments on both synthetic and real-world datasets. In synthetic data, EagleEye accurately detects anomalies in multiple dimensions even when they affect a tiny fraction of the data. When applied to a challenging resonant anomaly detection benchmark task in simulated Large Hadron Collider data, EagleEye successfully identifies particle decay events present in just 0.3% of the dataset. In global temperature data, EagleEye uncovers previously unidentified, geographically localised changes in temperature fields that occurred in the most recent years. Thanks to its key advantages of conceptual simplicity, computational efficiency, trivial parallelisation, and scalability, EagleEye is widely applicable across many fields.

Detecting Localized Density Anomalies in Multivariate Data via Coin-Flip Statistics

TL;DR

EagleEye tackles the problem of detecting localized density differences between two multivariate datasets via a simple, fully unsupervised two-sample framework. It casts local neighbourhood composition as a coin-flip process, computing a per-point anomaly score and flagging potential anomalies, which are then refined by Iterative Density Equalization (IDE) and multimodal repêchage to isolate genuine local overdensities. An injection-based procedure estimates irreducible background and yields an unsupervised estimate of the local signal purity , enabling a global assessment via the aggregated anomaly sets. Demonstrations on synthetic Gaussian-density anomalies, LHC resonance detection, and climate-temperature data show EagleEye’s ability to detect tiny localized signals (e.g., 0.3\% at the LHC) while maintaining scalability and deterministic, parallelizable computation. The method is broadly applicable to fields ranging from high-energy physics to climate science, offering per-region localization, robust performance in high dimensions, and explicit estimates of background and signal composition without reliance on kernel-based models or pre-specified signal regions.

Abstract

Detecting localized density differences in multivariate data is a crucial task in computational science. Such anomalies can indicate a critical system failure, lead to a groundbreaking scientific discovery, or reveal unexpected changes in data distribution. We introduce EagleEye, an anomaly detection method to compare two multivariate datasets with the aim of identifying local density anomalies, namely over- or under-densities affecting only localised regions of the feature space. Anomalies are detected by modelling, for each point, the ordered sequence of its neighbours' membership label as a coin-flipping process and monitoring deviations from the expected behaviour of such process. A unique advantage of our method is its ability to provide an accurate, entirely unsupervised estimate of the local signal purity. We demonstrate its effectiveness through experiments on both synthetic and real-world datasets. In synthetic data, EagleEye accurately detects anomalies in multiple dimensions even when they affect a tiny fraction of the data. When applied to a challenging resonant anomaly detection benchmark task in simulated Large Hadron Collider data, EagleEye successfully identifies particle decay events present in just 0.3% of the dataset. In global temperature data, EagleEye uncovers previously unidentified, geographically localised changes in temperature fields that occurred in the most recent years. Thanks to its key advantages of conceptual simplicity, computational efficiency, trivial parallelisation, and scalability, EagleEye is widely applicable across many fields.

Paper Structure

This paper contains 13 sections, 23 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Sketch illustrating the conceptual steps in the EagleEye density anomaly detection methodology detailed in the text in Sec. \ref{['sec:flagging']} (sub-panel a.), Sec. \ref{['sec:pruning']} (sub-panel b.), Sec. \ref{['sec:repechage']} (sub-panel c.) and Sec. \ref{['sec:injection']} (sub-panel d.). In each sub-panel, the plot on the left represents anomaly score (vertical axis) against feature space location of samples. For this cartoon, the anomaly is a top-hat overdensity (grey density in the top right panel).
  • Figure 2: EagleEye detection of density anomalies within a uniform background. (A): distribution of anomalies in feature space, showing overdensities (orange) and underdensities (violet). (B): Points flagged as anomalous in the test set (warm orange shades) and in the reference set (cool violet shades). (C) and (D): Local anomalies after Iterative Density Equalization (dark green) and multimodal repêchage (light green). On the left, overdensities, on the right, underdensities. (E) and (F): Distribution of the anomaly score $\mathbf{\Upsilon}_i$ for sequences of Bernoulli trials (black, null distribution); the test set (gray), the pruned set (dark green); the clustered anomaly set (light green), and the equalized set (blue, matching the null distribution).
  • Figure 3: Resonant anomaly detection on the LHC Olympics R&D dataset using EagleEye . This figure demonstrates the application of EagleEye for resonant anomaly detection using the LHC Olympics R&D dataset. Each column corresponds to a different fraction of overdensity anomaly. The plots in the first two rows display 2D slices of the 8-dimensional feature space (in $m_{1}$ and $m_{2}$, normalised units) that best illustrate the multi-modal distribution of the signal events (anomalous data). A–C Histograms of the total reference (background) counts, with signal events overlaid as gold scatter points. D–F The sets $\hat{\mathcal{Y}}^+$ (dark green) and $\bigcup_\alpha{\mathcal{Y}}^\text{anom}_\alpha$ (light green) obtained using an initial threshold $\mathbf{\Upsilon}_i \geq \mathbf{\Upsilon}^*_+ = 14$ corresponding to a quantile of approximately $4\sigma$ of the background-only $\mathbf{\Upsilon}_i$ distribution with $K_M = 1000$. G–I Histograms of the $\mathbf{\Upsilon}_i$ distributions on a logarithmic scale for: sequences of Bernoulli trials (black), the test set (gray), the pruned set $\hat{\mathcal{Y}}^+$ (dark green), the repêchage set $\mathcal{Y}_\alpha^{\mathrm{anom}}$ (light green), and the equalized set $\mathcal{Y}^\text{eq}$ (blue). The initial critical threshold $\mathbf{\Upsilon}_+^*=14$ is indicated by a red dashed line.
  • Figure 4: Analysis of Air2m anomalies for the DJF and JJA seasons. This figure demonstrates the application of EagleEye to detect shifts in temperature patterns over the past seventy years using global daily average air temperature fields measured at 2 m above sea level (Air2m). The analysis focuses on the Northern Hemisphere, with separate examinations for winter (DJF) and summer (JJA) seasons. Panels A–D: The number of anomalous days identified in the pruned set $\hat{\mathcal{Y}}^+$ is shown for a moving $60^\circ$ longitudinal window. Each of the three time periods comprises 2130 days. Negative values indicate $\mathcal{X}$-overdensities (i.e., regions where anomalies correspond to an overdensity in the reference set), which aids in readability; panels A and C correspond to JJA, while panels B and D correspond to DJF. A sensitivity analysis, performed by shifting the reference period, confirmed the robustness of these results (only the average is shown, as the variance was negligible). Panels E–H: These panels display the average de-trended and de-seasonalized Air2m anomalies for the day with the largest $\mathbf{\Upsilon}_i$ and its ten nearest neighbors from the repêchage set $\mathcal{Y}^+_{\alpha}$. For DJF, the central longitudes are $30^\circ$W and $20^\circ$W (panels G and H), while for JJA they are $180^\circ$ and $30^\circ$W (panels E and F). Green boxes in each subplot delineate the regions used for the computations.
  • Figure 5: Variation in the cardinality of the repêchage sets (vertical axis) as a function of the number of contaminating points (horizontal-axis) in the test set $\mathcal{Y}$. Here, the maximum neighbourhood rank is set to $K_M = 500$ and the default $p_\text{ext}=10^{-5}$ is used. Panel A corresponds to datasets of 10,000 test and reference points, whereas Panel B shows datasets of 100,000 points, where the additional 90,000 points are drawn from the same underlying Gaussian distribution.
  • ...and 4 more figures