Table of Contents
Fetching ...

Sparse, self-organizing ensembles of local kernels detect rare statistical anomalies

Gaia Grosso, Sai Sumedh R. Hindupur, Thomas Fel, Samuel Bright-Thonney, Philip Harris, Demba Ba

TL;DR

This work tackles the challenge of detecting rare, in-distribution anomalies in high-dimensional representations by proposing SparKer, a sparse, self-organizing ensemble of local Gaussian kernels trained under a semi-supervised Neyman–Pearson objective to approximate the log-density ratio between inspected data and an anomaly-free reference. The method enforces three principled inductive biases—sparsity, locality, and competition—via kernel-level SoftMax interactions and scale annealing, yielding interpretable, region-specific anomaly localization even in spaces with thousands of dimensions. Theoretical analysis casts SparKer as an energy-based learning system with particle-like kernel dynamics, proving that scale-annealing and local interactions drive convergence to anomalous regions and enabling automatic discovery of multiple, distinct anomalies. Empirically, SparKer outperforms kernel-based NP and MMD baselines across diverse domains, including scientific discovery, open-world novelty, intrusion detection, and generative AI validation, often with only a handful of kernels. The framework offers scalable, interpretable anomaly detection with practical impact for scientific inference, ML safety, and model validation, and points to future work on adaptive kernel covariances and broader applications beyond anomaly detection.

Abstract

Modern artificial intelligence has revolutionized our ability to extract rich and versatile data representations across scientific disciplines. Yet, the statistical properties of these representations remain poorly controlled, causing misspecified anomaly detection (AD) methods to falter. Weak or rare signals can remain hidden within the apparent regularity of normal data, creating a gap in our ability to detect and interpret anomalies. We examine this gap and identify a set of structural desiderata for detection methods operating under minimal prior information: sparsity, to enforce parsimony; locality, to preserve geometric sensitivity; and competition, to promote efficient allocation of model capacity. These principles define a class of self-organizing local kernels that adaptively partition the representation space around regions of statistical imbalance. As an instantiation of these principles, we introduce SparKer, a sparse ensemble of Gaussian kernels trained within a semi-supervised Neyman--Pearson framework to locally model the likelihood ratio between a sample that may contain anomalies and a nominal, anomaly-free reference. We provide theoretical insights into the mechanisms that drive detection and self-organization in the proposed model, and demonstrate the effectiveness of this approach on realistic high-dimensional problems of scientific discovery, open-world novelty detection, intrusion detection, and generative-model validation. Our applications span both the natural- and computer-science domains. We demonstrate that ensembles containing only a handful of kernels can identify statistically significant anomalous locations within representation spaces of thousands of dimensions, underscoring both the interpretability, efficiency and scalability of the proposed approach.

Sparse, self-organizing ensembles of local kernels detect rare statistical anomalies

TL;DR

This work tackles the challenge of detecting rare, in-distribution anomalies in high-dimensional representations by proposing SparKer, a sparse, self-organizing ensemble of local Gaussian kernels trained under a semi-supervised Neyman–Pearson objective to approximate the log-density ratio between inspected data and an anomaly-free reference. The method enforces three principled inductive biases—sparsity, locality, and competition—via kernel-level SoftMax interactions and scale annealing, yielding interpretable, region-specific anomaly localization even in spaces with thousands of dimensions. Theoretical analysis casts SparKer as an energy-based learning system with particle-like kernel dynamics, proving that scale-annealing and local interactions drive convergence to anomalous regions and enabling automatic discovery of multiple, distinct anomalies. Empirically, SparKer outperforms kernel-based NP and MMD baselines across diverse domains, including scientific discovery, open-world novelty, intrusion detection, and generative AI validation, often with only a handful of kernels. The framework offers scalable, interpretable anomaly detection with practical impact for scientific inference, ML safety, and model validation, and points to future work on adaptive kernel covariances and broader applications beyond anomaly detection.

Abstract

Modern artificial intelligence has revolutionized our ability to extract rich and versatile data representations across scientific disciplines. Yet, the statistical properties of these representations remain poorly controlled, causing misspecified anomaly detection (AD) methods to falter. Weak or rare signals can remain hidden within the apparent regularity of normal data, creating a gap in our ability to detect and interpret anomalies. We examine this gap and identify a set of structural desiderata for detection methods operating under minimal prior information: sparsity, to enforce parsimony; locality, to preserve geometric sensitivity; and competition, to promote efficient allocation of model capacity. These principles define a class of self-organizing local kernels that adaptively partition the representation space around regions of statistical imbalance. As an instantiation of these principles, we introduce SparKer, a sparse ensemble of Gaussian kernels trained within a semi-supervised Neyman--Pearson framework to locally model the likelihood ratio between a sample that may contain anomalies and a nominal, anomaly-free reference. We provide theoretical insights into the mechanisms that drive detection and self-organization in the proposed model, and demonstrate the effectiveness of this approach on realistic high-dimensional problems of scientific discovery, open-world novelty detection, intrusion detection, and generative-model validation. Our applications span both the natural- and computer-science domains. We demonstrate that ensembles containing only a handful of kernels can identify statistically significant anomalous locations within representation spaces of thousands of dimensions, underscoring both the interpretability, efficiency and scalability of the proposed approach.

Paper Structure

This paper contains 57 sections, 9 theorems, 38 equations, 16 figures, 7 tables.

Key Result

Lemma 1

The dynamics of the $i^{th}$ kernel location ${\bm{\mu}}_i$ is the result of radial forces arising from the training points. Let $y$ be the class label associated with each data point ($y=0$ for ${\bm{x}}\in {\cal{R}}$ and $y=1$ for ${\bm{x}} \in {\cal{D}}$), and let $m_i({\bm{x}})=2a_i k_i({\bm{x}}

Figures (16)

  • Figure 1: Detecting rare in-distribution anomalies with SparKer. (a) Cartoon diagram illustrating the regimes of anomaly detection; problems are characterized by anomaly separability ($x$-axis) and anomaly fraction ($y$-axis). Detection becomes harder near the origin, where anomalies are rare and overlap with normal data. The right panel qualitatively pictures the different corners of the space. The left panels highlights the region SparKer. is designed for, e.g. low anomaly fraction and separability. The stars indicate the suite of applications from natural and computer science that we showcase in Sections \ref{['sec:applications']} and \ref{['sec:interp']}. (b) SparKer's anomaly detection pipeline: source data are embedded to extract domain-specific features; SparKer. compares the distribution of the inspected data to anomaly-free data in feature space by means of a self-organizing ensemble of local kernels, and extracts an anomaly score; (c) Sparsity, locality and competition, which underlie the SparKer. model, make it possible to decompose the anomaly score into distinct, interpretable components, $\{f_i\}$, pointing at different anomalous regions in the feature space, thus enabling a geometry-aware analysis of the anomalies.
  • Figure 2: Local layers enable geometric interpretation of activation patterns. Examples of 2D log-density ratio approximated by a 1-layer model with ReLU activation (top left grid) and mixture of Gaussian kernels (bottom left grid) subject to different sparsity constraints. The local activation characterizing Gaussian kernels enables efficient selection of convex regions, whereas linear activation requires the interplay of multiple parameters, undermining interpretability. We report on the right side the ground truth of the log-density ratio function targeted by the model, as well as the density of the sample with injected anomalies.
  • Figure 3: Interpreting SparKer. as a system of interacting particles. The training data (i.e. the visible units) in orange represent the system environment while the kernels (i.e. hidden units) are latent variables of the system, capturing higher order correlations in the data patterns. In the absence of the ${\rm SoftMax}$ activation (left graphics), the hidden units interact with the environment but are not aware of each others. ${\rm SoftMax}$ activation introduces an interaction term between kernels that promotes a joint optimization and units specialization (right graphics).
  • Figure 4: Attraction and repulsion govern kernels' learning dynamics. Example of kernels' trajectories over training time in a 2-dimensional setting where normal data are uniformly distributed in a square (gray points). Three sources of anomalies, highlighted in yellow, are injected in the data. We run a 4-kernels SparKer. model and examine the kernels' self-organization dynamics represented by the colored lines. Two stages can be identified in the dynamics: a first stage in which the kernels are broad and pushed to explore the space, and a second stage in which progressively narrow kernels are radially attracted towards the anomalous regions (see zoomed out panels).
  • Figure 5: Scale annealing overcomes the vanishing gradient leading to convergence. Kernel distance from anomalous point as a function of training time using linear annealing (blue line) or a fixed value of the kernel width (other colors). Kernels that are too wide or too narrow either do not localize the signal or do not converge in a reasonable amount of time. With $\sigma=1$, the model localizes the signal. With annealing, the model localizes the signal with progressively higher resolution.
  • ...and 11 more figures

Theorems & Definitions (16)

  • Lemma 1: Push-pull dynamic of kernels' locations
  • Definition 1: Kernel radial $\alpha$-level
  • Lemma 2: Kernel's Sphere of Influence
  • Lemma 3: Scale annealing leads to localization
  • Theorem 1: Local convergence
  • Lemma 4: Radial influence of points on kernels' location
  • proof
  • Lemma 5: Push-pull dynamic of kernels
  • proof
  • Definition 2: Kernel radial $\alpha$-level
  • ...and 6 more