Attention-Based Preprocessing Framework for Improving Rare Transient Classification
Xinyue Sheng, Tuan Dung Pham, Zichi Zhang, Matt Nicholl, Thai Son Mai
TL;DR
The paper tackles the difficulty of identifying rare astronomical transients under extreme class imbalance and image artefacts by introducing a data-centric augmentation pipeline that relies entirely on real observations. It combines image restoration via a Similarity Index, an attention-guided masking strategy to focus on the transient and host, and arbitrary-rotation image augmentation, with light-curve augmentation via a $2$-D Gaussian Process and cross-matched samples that preserve survey realism. A tailored focal loss and robust calibration framework are used to boost the purity of SLSNe-I and TDE classifications while maintaining practical detection rates, demonstrated on NEEDLE with real ZTF data. The approach yields substantial improvements in high-confidence classifications for rare classes, enabling more efficient spectroscopic follow-up in current and upcoming surveys (e.g., LSST) without resorting to physically modeled simulations. It represents a scalable, data-driven path to better characterize and follow up rare transients in large time-domain datasets.
Abstract
With large numbers of transients discovered by current and future imaging surveys, machine learning is increasingly applied to light curve and host galaxy properties to select events for follow-up. However, finding rare types of transients remains difficult due to extreme class imbalances in training sets, and extracting features from host images is complicated by the presence of bright foreground sources, particularly if the true host is faint or distant. Here we present a data augmentation pipeline for images and light curves that mitigates these issues, and apply this to improve classification of Superluminous Supernovae Type I (SLSNe-I) and Tidal Disruption Events (TDEs) with our existing NEEDLE code. The method uses a Similarity Index to remove image artefacts, and a masking procedure that removes unrelated sources while preserving the transient and its host. This focuses classifier attention on the relevant pixels, and enables arbitrary rotations for class upsampling. We also fit observed multi-band light curves with a two-dimensional Gaussian Process and generate data-driven synthetic samples by resampling and redshifting these models, cross-matching with galaxy images in the same class to produce unique but realistic new examples for training. Models trained with the augmented dataset achieve substantially higher purity: for classifications with a confidence of 0.8 or higher, we achieve 75% (43%) purity and 75% (66%) completeness for SLSNe-I (TDEs).
