Table of Contents
Fetching ...

In Search of the Unknown Unknowns: A Multi-Metric Distance Ensemble for Out of Distribution Anomaly Detection in Astronomical Surveys

Siddharth Chaini, Federica B. Bianco, Ashish Mahabal

TL;DR

This paper tackles the challenge of finding unknown astrophysical novelties in large time-domain surveys by reframing anomaly detection as Out-of-Distribution (OOD) discovery and rare in-distribution detection. It introduces Distance Multi-Metric Anomaly Detection (DiMMAD), a semi-supervised approach that ensembles $16$ distance metrics to compute a robust multi-metric anomaly score across multiple feature-space geometries. The core contributions include centroid-based training on known classes, a two-stage aggregation (per-metric then cross-metric) to rank test objects, and extensive evaluation on simulated LSST-like ELAsTiCC data and real ZTF data, demonstrating superior OOD discovery efficiency and diversity, with open-source implementations in DistClassiPy. The work provides a scalable, interpretable tool for prioritizing follow-up in LSST-era surveys and offers a framework that can incorporate newly discovered classes by updating centroids.

Abstract

Distance-based methods involve the computation of distance values between features and are a well-established paradigm in machine learning. In anomaly detection, anomalies are identified by their large distance from normal data points. However, the performance of these methods often hinges on a single, user-selected distance metric (e.g., Euclidean), which may not be optimal for the complex, high-dimensional feature spaces common in astronomy. Here, we introduce a novel anomaly detection method, Distance Multi-Metric Anomaly Detection (DiMMAD), which uses an ensemble of distance metrics to find novelties. Using multiple distance metrics is effectively equivalent to using different geometries in the feature space. By using a robust ensemble of diverse distance metrics, we overcome the metric-selection problem, creating an anomaly score that is not reliant on any single definition of distance. We demonstrate this multi-metric approach as a tool for simple, interpretable scientific discovery on astronomical time series -- (1) with simulated data for the upcoming Vera C. Rubin Observatory Legacy Survey of Space and Time, and (2) real data from the Zwicky Transient Facility. We find that DiMMAD excels at out-of-distribution anomaly detection -- anomalies in the data that might be new classes -- and beats other state-of-the-art methods in the goal of maximizing the diversity of new classes discovered. For rare in-distribution anomaly detection, DiMMAD performs similarly to other methods, but may allow for improved interpretability. All our code is open source: DiMMAD is implemented within DistClassiPy: https://github.com/sidchaini/distclassipy/, while all code to reproduce the results of this paper is available here: https://github.com/sidchaini/dimmad/.

In Search of the Unknown Unknowns: A Multi-Metric Distance Ensemble for Out of Distribution Anomaly Detection in Astronomical Surveys

TL;DR

This paper tackles the challenge of finding unknown astrophysical novelties in large time-domain surveys by reframing anomaly detection as Out-of-Distribution (OOD) discovery and rare in-distribution detection. It introduces Distance Multi-Metric Anomaly Detection (DiMMAD), a semi-supervised approach that ensembles distance metrics to compute a robust multi-metric anomaly score across multiple feature-space geometries. The core contributions include centroid-based training on known classes, a two-stage aggregation (per-metric then cross-metric) to rank test objects, and extensive evaluation on simulated LSST-like ELAsTiCC data and real ZTF data, demonstrating superior OOD discovery efficiency and diversity, with open-source implementations in DistClassiPy. The work provides a scalable, interpretable tool for prioritizing follow-up in LSST-era surveys and offers a framework that can incorporate newly discovered classes by updating centroids.

Abstract

Distance-based methods involve the computation of distance values between features and are a well-established paradigm in machine learning. In anomaly detection, anomalies are identified by their large distance from normal data points. However, the performance of these methods often hinges on a single, user-selected distance metric (e.g., Euclidean), which may not be optimal for the complex, high-dimensional feature spaces common in astronomy. Here, we introduce a novel anomaly detection method, Distance Multi-Metric Anomaly Detection (DiMMAD), which uses an ensemble of distance metrics to find novelties. Using multiple distance metrics is effectively equivalent to using different geometries in the feature space. By using a robust ensemble of diverse distance metrics, we overcome the metric-selection problem, creating an anomaly score that is not reliant on any single definition of distance. We demonstrate this multi-metric approach as a tool for simple, interpretable scientific discovery on astronomical time series -- (1) with simulated data for the upcoming Vera C. Rubin Observatory Legacy Survey of Space and Time, and (2) real data from the Zwicky Transient Facility. We find that DiMMAD excels at out-of-distribution anomaly detection -- anomalies in the data that might be new classes -- and beats other state-of-the-art methods in the goal of maximizing the diversity of new classes discovered. For rare in-distribution anomaly detection, DiMMAD performs similarly to other methods, but may allow for improved interpretability. All our code is open source: DiMMAD is implemented within DistClassiPy: https://github.com/sidchaini/distclassipy/, while all code to reproduce the results of this paper is available here: https://github.com/sidchaini/dimmad/.

Paper Structure

This paper contains 12 sections, 5 figures.

Figures (5)

  • Figure 1: A visualization of 15 (of total 16) distance metrics used in our anomaly detector, DiMMAD. In each panel, the intensity of the color denotes the distance value from a central point $(5,5)$ for that metric in a 2-dimensional feature space. Contours have been plotted for clarity. To aid readability, Kulczynski has been plotted in a log scale. Each distance metric has a distinct geometry, and helps in making DiMMAD more robust. The correlation metric is omitted from this plot, as it cannot be meaningfully represented within a two-dimensional space.
  • Figure 2: Anomaly Detection Performance of our method (DiMMAD), compared to other standard AD methods. The shaded region denotes $1\sigma$ errors from Monte Carlo cross-validation. Left: the mean purity of OOD anomalies within the top $N$ candidates on the ELAsTiCC data. The DiMMAD methods (solid lines) maintain significantly higher purity. Right: the cumulative number of new, unique OOD classes discovered on the ELAsTiCC data. DiMMAD discovers a greater, more diverse set of anomalies more efficiently.
  • Figure 3: As \ref{['fig:exp1']} (left) but for the case of real ZTF (ALeRCE) data.
  • Figure 4: Cumulative number of objects discovered vs. follow-up budget for individual OOD classes from the ELAsTiCC dataset. Each panel represents a separate experiment where the titled class was the sole anomaly type. DiMMAD (bold orange/purple) consistently performs among the best methods for most classes of OOD anomalies across a diverse range of physical phenomena
  • Figure 5: Anomaly Detection Performance of our method (DiMMAD), compared to other standard AD methods on the ELAsTiCC data as in \ref{['fig:exp1']}, but for Rare In-Distribution Anomalies. The shaded region denotes the $1\sigma$ errors from Monte Carlo cross-validation.