Finding Pegasus: Enhancing Unsupervised Anomaly Detection in High-Dimensional Data using a Manifold-Based Approach
R. P. Nathan, Nikolaos Nikolaou, Ofer Lahav
TL;DR
This work reframes unsupervised anomaly detection in high-dimensional data through the lens of the low-dimensional manifold produced by dimensionality reduction. It distinguishes on-manifold and off-manifold anomalies and introduces the Finding Pegasus approach, which combines complementary on- and off-manifold AD methods to boost recall when substantial DR is applied, without sacrificing precision. The authors formalize a framework linking raw data, dimensionality reduction, and anomaly detectability, and validate the concepts on MNIST with both PCA and autoencoder manifolds, showing substantial improvements in detection of diverse anomaly types (including novel, Pegasus-type cases). The approach holds promise for real-world high-D settings (e.g., astronomy, imaging, fraud detection) by enabling more comprehensive and efficient anomaly discovery through manifold-aware method integration.
Abstract
Unsupervised machine learning methods are well suited to searching for anomalies at scale but can struggle with the high-dimensional representation of many modern datasets, hence dimensionality reduction (DR) is often performed first. In this paper we analyse unsupervised anomaly detection (AD) from the perspective of the manifold created in DR. We present an idealised illustration, "Finding Pegasus", and a novel formal framework with which we categorise AD methods and their results into "on manifold" and "off manifold". We define these terms and show how they differ. We then use this insight to develop an approach of combining AD methods which significantly boosts AD recall without sacrificing precision in situations employing high DR. When tested on MNIST data, our approach of combining AD methods improves recall by as much as 16 percent compared with simply combining with the best standalone AD method (Isolation Forest), a result which shows great promise for its application to real-world data.
