Learning to Detect Interesting Anomalies
Alireza Vafaei Sadr, Bruce A. Bassett, Emmanuel Sekyi
TL;DR
AHUNT addresses the challenge of detecting interesting anomalies that have not been seen by leveraging a dynamic feature space learned through active learning. By iteratively labeling strategically chosen examples and retraining a CNN, AHUNT evolves the latent representation and grows an anomaly taxonomy with a reserve class, enabling personalized rankings of anomaly classes. Across MNIST, CIFAR-10, and DESI, it outperforms static feature spaces and traditional anomaly detectors, demonstrating robust gains and adaptable handling of changing user interests. This approach offers a scalable, user-guided path to discovering meaningful anomalies in large, diverse datasets such as astronomical surveys.
Abstract
Anomaly detection algorithms are typically applied to static, unchanging, data features hand-crafted by the user. But how does a user systematically craft good features for anomalies that have never been seen? Here we couple deep learning with active learning -- in which an Oracle iteratively labels small amounts of data selected algorithmically over a series of rounds -- to automatically and dynamically improve the data features for efficient outlier detection. This approach, AHUNT, shows excellent performance on MNIST, CIFAR10, and Galaxy-DESI data, significantly outperforming both standard anomaly detection and active learning algorithms with static feature spaces. Beyond improved performance, AHUNT also allows the number of anomaly classes to grow organically in response to Oracle's evaluations. Extensive ablation studies explore the impact of Oracle question selection strategy and loss function on performance. We illustrate how the dynamic anomaly class taxonomy represents another step towards fully personalized rankings of different anomaly classes that reflect a user's interests, allowing the algorithm to learn to ignore statistically significant but uninteresting outliers (e.g., noise). This should prove useful in the era of massive astronomical datasets serving diverse sets of users who can only review a tiny subset of the incoming data.
