Table of Contents
Fetching ...

A Classifier-Based Approach to Multi-Class Anomaly Detection Applied to Astronomical Time-Series

Rithwik Gupta, Daniel Muthukrishna, Michelle Lochner

TL;DR

The paper tackles automated anomaly detection in time-domain astronomy by leveraging the latent space of a classifier. It introduces Multi-Class Isolation Forests (MCIF), which trains a separate isolation forest for each known class and uses the minimum class-specific score to decide anomalies, applied to a 100-dimensional latent representation from a GRU-based light-curve classifier. On simulated ZTF-like data, MCIF achieves strong anomaly recall and competitive AUROC compared to state-of-the-art approaches, and analysis reveals how latent-space clustering affects performance. The approach supports real-time detection implications and demonstrates that repurposing classifiers as anomaly detectors can scale to the data deluge expected from LSST, with code publicly available for broader reuse.

Abstract

Automating anomaly detection is an open problem in many scientific fields, particularly in time-domain astronomy, where modern telescopes generate millions of alerts per night. Currently, most anomaly detection algorithms for astronomical time-series rely either on hand-crafted features or on features generated through unsupervised representation learning, coupled with standard anomaly detection algorithms. In this work, we introduce a novel approach that leverages the latent space of a neural network classifier for anomaly detection. We then propose a new method called Multi-Class Isolation Forests (MCIF), which trains separate isolation forests for each class to derive an anomaly score for an object based on its latent space representation. This approach significantly outperforms a standard isolation forest when distinct clusters exist in the latent space. Using a simulated dataset emulating the Zwicky Transient Facility (54 anomalies and 12,040 common), our anomaly detection pipeline discovered $46\pm3$ anomalies ($\sim 85\%$ recall) after following up the top 2,000 ($\sim 15\%$) ranked objects. Furthermore, our classifier-based approach outperforms or approaches the performance of other state-of-the-art anomaly detection pipelines. Our novel method demonstrates that existing and new classifiers can be effectively repurposed for real-time anomaly detection. The code used in this work, including a Python package, is publicly available, https://github.com/Rithwik-G/AstroMCAD.

A Classifier-Based Approach to Multi-Class Anomaly Detection Applied to Astronomical Time-Series

TL;DR

The paper tackles automated anomaly detection in time-domain astronomy by leveraging the latent space of a classifier. It introduces Multi-Class Isolation Forests (MCIF), which trains a separate isolation forest for each known class and uses the minimum class-specific score to decide anomalies, applied to a 100-dimensional latent representation from a GRU-based light-curve classifier. On simulated ZTF-like data, MCIF achieves strong anomaly recall and competitive AUROC compared to state-of-the-art approaches, and analysis reveals how latent-space clustering affects performance. The approach supports real-time detection implications and demonstrates that repurposing classifiers as anomaly detectors can scale to the data deluge expected from LSST, with code publicly available for broader reuse.

Abstract

Automating anomaly detection is an open problem in many scientific fields, particularly in time-domain astronomy, where modern telescopes generate millions of alerts per night. Currently, most anomaly detection algorithms for astronomical time-series rely either on hand-crafted features or on features generated through unsupervised representation learning, coupled with standard anomaly detection algorithms. In this work, we introduce a novel approach that leverages the latent space of a neural network classifier for anomaly detection. We then propose a new method called Multi-Class Isolation Forests (MCIF), which trains separate isolation forests for each class to derive an anomaly score for an object based on its latent space representation. This approach significantly outperforms a standard isolation forest when distinct clusters exist in the latent space. Using a simulated dataset emulating the Zwicky Transient Facility (54 anomalies and 12,040 common), our anomaly detection pipeline discovered anomalies ( recall) after following up the top 2,000 () ranked objects. Furthermore, our classifier-based approach outperforms or approaches the performance of other state-of-the-art anomaly detection pipelines. Our novel method demonstrates that existing and new classifiers can be effectively repurposed for real-time anomaly detection. The code used in this work, including a Python package, is publicly available, https://github.com/Rithwik-G/AstroMCAD.
Paper Structure (18 sections, 3 equations, 10 figures, 2 tables)

This paper contains 18 sections, 3 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: A visual summary of the architecture described in this work. Our approach first trains a classifier, then repurposes it as an encoder, and finally applies Multi-Class Isolation Forests (MCIF), proposed in this work, for anomaly detection.
  • Figure 2: The UMAP reduction of the latent space derived from the test set, which includes 10% of the common transients reserved for testing the classifier [left] and randomly sampled anomalous transients from the unseen anomaly dataset [right]. Despite not being trained on this data, the learned features still exhibit clear visual structure and anomalous transients form distinct clusters separate from the common classes. It is important to note that the UMAP reduction is used only for visualization purposes, and the actual anomaly detection is performed on the nine-dimensional latent space.
  • Figure 3: The distribution of anomaly scores for each class, computed using MCIF [left] or a single isolation forest [right] on the latent representations derived from full light curves. The scores are plotted using $100\%$ of the anomalous dataset (unseen during training) and the test dataset of common classes. The anomalous classes (bottom five in red) generally show higher anomaly scores with positively skewed distributions when using MCIF, however this is less true when using a single isolation forest. The common classes and CaRTs all have low anomaly scores when using MCIF.
  • Figure 4: Anomalies detected in the 2,000 top-ranked transients by MCIF anomaly score index, using a test sample reflecting the estimated frequency of anomalies in nature. In the sample of 12,040 common transients and 54 anomalous transients, the model recalls $46\pm3$$(\sim85\%)$ of the anomalies after following up the top 2,000 ranked transients. The left plot aggregates all anomalies and the right plot delineates per class. To control for the variance imposed by the small anomaly sample size, we repeat the sampling 50 times. The mean and standard deviation of detected anomalies are plotted as the solid lines and shaded regions, respectively.
  • Figure 5: The UMAP reduction of the training data in the latent space for a classifier trained for detecting the class SNII [left] and DSCT [right] as anomalous using the data introduced in Perez-Carrasco_2023 and used in Section \ref{['sec:benchmarking']}. As the UMAP only plots the training data, it includes all the classes in the respective hierarchical category (seen in Table \ref{['table:results']}) but the one set aside as anomalous.
  • ...and 5 more figures