Light Curve Classification with DistClassiPy: a new distance-based classifier

Siddharth Chaini; Ashish Mahabal; Ajit Kembhavi; Federica B. Bianco

Light Curve Classification with DistClassiPy: a new distance-based classifier

Siddharth Chaini, Ashish Mahabal, Ajit Kembhavi, Federica B. Bianco

TL;DR

This work addresses the challenge of scalable, interpretable light-curve classification in time-domain astronomy by introducing DistClassiPy, a distance-metric classifier built on 18 metrics and domain-driven light-curve features. The method reduces 114 features to a compact 31-feature set and uses per-class medians with distance-based scoring, achieving $F_1$ scores comparable to a Random Forest baseline while offering faster computation and enhanced interpretability. Key contributions include a transparent feature-selection pipeline, confidence measures for distance-based decisions, and a publicly available open-source package suitable for large surveys like the Rubin Observatory LSST. The results demonstrate robust performance across multi-class, One-vs-Rest, and binary tasks, with strong scalability and potential for tailoring to specific science goals and datasets beyond astronomy.

Abstract

The rise of synoptic sky surveys has ushered in an era of big data in time-domain astronomy, making data science and machine learning essential tools for studying celestial objects. While tree-based models (e.g. Random Forests) and deep learning models dominate the field, we explore the use of different distance metrics to aid in the classification of astrophysical objects. We developed DistClassiPy, a new distance metric based classifier. The direct use of distance metrics is unexplored in time-domain astronomy, but distance-based methods can help make classification more interpretable and decrease computational costs. In particular, we applied DistClassiPy to classify light curves of variable stars, comparing the distances between objects of different classes. Using 18 distance metrics on a catalog of 6,000 variable stars across 10 classes, we demonstrate classification and dimensionality reduction. Our classifier meets state-of-the-art performance but has lower computational requirements and improved interpretability. Additionally, DistClassiPy can be tailored to specific objects by identifying the most effective distance metric for that classification. To facilitate broader applications within and beyond astronomy, we have made DistClassiPy open-source and available at https://pypi.org/project/distclassipy/.

Light Curve Classification with DistClassiPy: a new distance-based classifier

TL;DR

scores comparable to a Random Forest baseline while offering faster computation and enhanced interpretability. Key contributions include a transparent feature-selection pipeline, confidence measures for distance-based decisions, and a publicly available open-source package suitable for large surveys like the Rubin Observatory LSST. The results demonstrate robust performance across multi-class, One-vs-Rest, and binary tasks, with strong scalability and potential for tailoring to specific science goals and datasets beyond astronomy.

Abstract

Paper Structure (30 sections, 1 theorem, 3 equations, 19 figures, 5 tables, 2 algorithms)

This paper contains 30 sections, 1 theorem, 3 equations, 19 figures, 5 tables, 2 algorithms.

Introduction
Distances in Machine Learning
Data
Catalog and Raw Light Curves
Feature Extraction
Data Cleaning
Feature Selection and Dimensionality Reduction
Drop all g-band features
Dropping flags and number of points
Dropping Highly Correlated Features
Classification
Classification Problems
DistClassiPy Classification Algorithm
Training
Predicting
...and 15 more sections

Key Result

Corollary 1

Figures (19)

Figure 1: A visualization of 16 (of the 18) distance metrics used throughout this work. Each subplot shows the equidistant loci measuring the distance from the central point $(5,5)$. The color background denotes the distance values, with labeled contours. Contours differ for each subplot, as the range of values the distance can take varies by metric. To aid readability, we use a log-scale for the last two metrics --- Kulczynski and Additive ChiSq due to high-power elements (in the metric definition) compressing the distance scale. The Correlation and Maryland Bridge metrics are not visualized here as they require vector inputs, and not 2-dimensional data points (see \ref{['app:metrics']}).
Figure 2: An example light curve of a RR Lyrae Type ab (RA = 2.93, Dec = 44.62, period = 0.55 days), taken from the ZTF DR15. Each light curve consists of a series of magnitude (brightness) values as a function of time, which in our case is for the two filters --- $g$ and $r$. The sampling of ZTF is sparse, as common for ground-based surveys, with visible gaps due to seasonality, and, although this is a periodic variable, the periodicity is not obvious because of the sampling.
Figure 3: The distribution of pairwise Cityblock distances ($d_{CB}$) between all 558 Cepheid variable stars in our final dataset (class CEP): 155,403 unique pairs of CEP contribute one point each to this distribution. The left and right panels show the distribution of distances before and after outlier removal respectively. The top panels show a box and whiskers plot of the same, where the median is marked by a vertical line, the interquartile range by the box, the 10th and 90th percentile by the whiskers. All points beyond these percentile values (our definition of outlier, see \ref{['subsec:data_cleaning']}) are plotted individually. In the left plots (before outlier removal) the original distribution spreads out to $d_{CB}>10^9$ but its extremely sparse past $d_{CB}\geq10^5$ (see insert which zooms into the $d_{CB}<10^6$ region, and notice how in the top plot the box and whiskers are indistinguishable). However, outlier removal leads to a much more compact distribution of pair-wise distances (right) where the distances are contained between $0 <d_{CB} <1,000$. Note that, as the distribution changes after the first cut, the functional definition of outliers does not, thus the tail of the distribution is plotted with individual points in the right-side figure.
Figure 4: Correlation between the first two components of a harmonic series fit to each light curve in the $r$-band. The bottom left panel shows the linear relationship between the two components, while the remaining two panels illustrate the distribution of the component per class. We find that these two components have a Pearson's linear correlation coefficient $r=0.95$, and thus choose only Harmonics_mag_1 as part of our feature selection step.
Figure 5: Correlation matrices of our features space before and after dimensionality reduction for four exemplary labels: CEP, DSCT, RR, RRc (see \ref{['tab:classes']} for a definition of each class), which we use for the multi-class classification problem. Each cell in the plot represents the Pearson's linear correlation coefficient, $r$, between the corresponding feature on the $x$ and $y$ axis. By definition, the diagonal has a correlation $r=1$. No correlation ($r=0$) is mapped to the color white, positive correlation ($r>0$) to shades of blue, while negative correlation ($r<0$) to shades of red. Our original set of 114 features per object (a) is computed using lc_classifier. These consisted of features calculated in the $r$, $g$, and $g-r$ (multiband) light curves. In (a), the correlation between features is evident for each class in the block structure of the correlation matrix. We dropped $g$-based features, features that are not physically motivated, and finally, features that have a very high correlation. This leads to the correlation matrices shown in (b). Our final feature space has 31 features.
...and 14 more figures

Theorems & Definitions (8)

Definition 2.1
Definition A.1
Corollary 1
Example A.1
Definition A.1
Example A.2
Definition A.2
Example A.3

Light Curve Classification with DistClassiPy: a new distance-based classifier

TL;DR

Abstract

Light Curve Classification with DistClassiPy: a new distance-based classifier

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (19)

Theorems & Definitions (8)