Weakly-Supervised Anomaly Detection in the Milky Way

Mariel Pettee; Sowmya Thanvantri; Benjamin Nachman; David Shih; Matthew R. Buckley; Jack H. Collins

Weakly-Supervised Anomaly Detection in the Milky Way

Mariel Pettee, Sowmya Thanvantri, Benjamin Nachman, David Shih, Matthew R. Buckley, Jack H. Collins

TL;DR

The paper demonstrates that Classification Without Labels (CWoLa) enables weakly-supervised, model-agnostic anomaly detection to locate stellar streams in Gaia DR2 data. By scanning across patches in proper-motion space and training a neural classifier to distinguish signal-enriched and signal-depleted regions, the method identifies both simulated streams and the GD-1 stream with modest computation. It achieves 56% purity and 51% completeness for GD-1 across 21 patches and uncovers density features such as gaps, spurs, blobs, and cocoon-like structures, while offering a route to augment existing stream catalogs. The approach is computationally efficient, scalable, and broadly applicable to astrophysical anomaly detection beyond high-energy physics.

Abstract

Large-scale astrophysics datasets present an opportunity for new machine learning techniques to identify regions of interest that might otherwise be overlooked by traditional searches. To this end, we use Classification Without Labels (CWoLa), a weakly-supervised anomaly detection method, to identify cold stellar streams within the more than one billion Milky Way stars observed by the Gaia satellite. CWoLa operates without the use of labeled streams or knowledge of astrophysical principles. Instead, we train a classifier to distinguish between mixed samples for which the proportions of signal and background samples are unknown. This computationally lightweight strategy is able to detect both simulated streams and the known stream GD-1 in data. Originally designed for high-energy collider physics, this technique may have broad applicability within astrophysics as well as other domains interested in identifying localized anomalies.

Weakly-Supervised Anomaly Detection in the Milky Way

TL;DR

Abstract

Paper Structure (21 sections, 1 equation, 14 figures)

This paper contains 21 sections, 1 equation, 14 figures.

Introduction
Motivation
Related Work
CWoLa: Classification Without Labels
Outline
Gaia Dataset
Data Preprocessing
Methods
Classification Without Labels (CWoLa)
Defining Signal & Sideband Regions
Neural Network Architecture and Training Procedure
Model Evaluation
Results
GD-1 Stream Identification
Towards an Augmentation of the GD-1 Stream Labeling
...and 6 more sections

Figures (14)

Figure 1: Two-dimensional histograms of the six features used in this analysis are illustrated for a single patch in the sky containing some GD-1 stars. This patch is centered at ($l=207.0$, $b=50.2$). The top row shows the full patch with no selections applied. The second row shows the patch with fiducial selections applied: $g < 20.2$ to reduce streaking; $|\mu_{\lambda}| > 2$ mas/year or $|\mu_\phi^{*}| > 2$ mas/year to remove too-distant stars, and $0.5 \leq b-r \leq 1$ to focus on identifying cold stellar streams. The third row indicates the six features for the GD-1 stream following the fiducial selections.
Figure 2: Signal-enriched and signal-depleted groups are pictured above. The green data points labeled "S" represent signal events, while the red data points labeled "B" represent background events. The signal and sideband regions are chosen such that more signal events (shown in orange) are located in the central signal region than the surrounding sideband region.
Figure 3: Stars associated with the stellar stream GD-1 are highly localized in $\mu_\lambda$ space in comparison with background stars for the same patch of Gaia data seen in Figure \ref{['fig:inputs']}. The signal region, shown in the darkest regions in each plot, is defined by taking $\pm1\sigma$ from the median $\mu_\lambda$ value for the stream stars, which in this case is $[-13.6, -11.4]$. The sideband region is defined by taking $\pm3\sigma$ from the stream's median $\mu_\lambda$ value, excluding the signal region: $[-15.8, -13.6)$ & $(-11.4, -9.3].$
Figure 4: Distributions for the five neural network inputs are compared for both GD-1 stars (in red) and background stars (in grey) across signal and sideband regions. The patch shown here is the same example patch from Figure \ref{['fig:inputs']}. For both stream and background stars, the distributions for these five variables across the signal and sideband regions are approximately indistinguishable.
Figure 5: The full scope of stars identified by the CWoLa method in overlapping patches across the angular range corresponding to GD-1. Light gray dots indicate the ground truth labeling of GD-1 stars pwb18, while the top 250 stars identfied by CWoLa in each patch are indicated in colored dots. The colors are chosen to correlate with each star's $\alpha$ value.
...and 9 more figures

Weakly-Supervised Anomaly Detection in the Milky Way

TL;DR

Abstract

Weakly-Supervised Anomaly Detection in the Milky Way

Authors

TL;DR

Abstract

Table of Contents

Figures (14)