Sparse, self-organizing ensembles of local kernels detect rare statistical anomalies
Gaia Grosso, Sai Sumedh R. Hindupur, Thomas Fel, Samuel Bright-Thonney, Philip Harris, Demba Ba
TL;DR
This work tackles the challenge of detecting rare, in-distribution anomalies in high-dimensional representations by proposing SparKer, a sparse, self-organizing ensemble of local Gaussian kernels trained under a semi-supervised Neyman–Pearson objective to approximate the log-density ratio between inspected data and an anomaly-free reference. The method enforces three principled inductive biases—sparsity, locality, and competition—via kernel-level SoftMax interactions and scale annealing, yielding interpretable, region-specific anomaly localization even in spaces with thousands of dimensions. Theoretical analysis casts SparKer as an energy-based learning system with particle-like kernel dynamics, proving that scale-annealing and local interactions drive convergence to anomalous regions and enabling automatic discovery of multiple, distinct anomalies. Empirically, SparKer outperforms kernel-based NP and MMD baselines across diverse domains, including scientific discovery, open-world novelty, intrusion detection, and generative AI validation, often with only a handful of kernels. The framework offers scalable, interpretable anomaly detection with practical impact for scientific inference, ML safety, and model validation, and points to future work on adaptive kernel covariances and broader applications beyond anomaly detection.
Abstract
Modern artificial intelligence has revolutionized our ability to extract rich and versatile data representations across scientific disciplines. Yet, the statistical properties of these representations remain poorly controlled, causing misspecified anomaly detection (AD) methods to falter. Weak or rare signals can remain hidden within the apparent regularity of normal data, creating a gap in our ability to detect and interpret anomalies. We examine this gap and identify a set of structural desiderata for detection methods operating under minimal prior information: sparsity, to enforce parsimony; locality, to preserve geometric sensitivity; and competition, to promote efficient allocation of model capacity. These principles define a class of self-organizing local kernels that adaptively partition the representation space around regions of statistical imbalance. As an instantiation of these principles, we introduce SparKer, a sparse ensemble of Gaussian kernels trained within a semi-supervised Neyman--Pearson framework to locally model the likelihood ratio between a sample that may contain anomalies and a nominal, anomaly-free reference. We provide theoretical insights into the mechanisms that drive detection and self-organization in the proposed model, and demonstrate the effectiveness of this approach on realistic high-dimensional problems of scientific discovery, open-world novelty detection, intrusion detection, and generative-model validation. Our applications span both the natural- and computer-science domains. We demonstrate that ensembles containing only a handful of kernels can identify statistically significant anomalous locations within representation spaces of thousands of dimensions, underscoring both the interpretability, efficiency and scalability of the proposed approach.
