Table of Contents
Fetching ...

StarEmbed: Benchmarking Time Series Foundation Models on Astronomical Observations of Variable Stars

Weijian Li, Hong-Yu Chen, Qinjie Lin, Nabeel Rehemtulla, Ved G. Shah, Dennis Wu, Adam A. Miller, Han Liu

TL;DR

Time-domain astronomy faces a data deluge of irregular, multivariate light curves that challenge traditional pipelines. StarEmbed provides the first public benchmark to evaluate time-series foundation models on ZTF light curves across seven classes, focusing on unsupervised clustering, supervised classification, and out-of-distribution detection under zero-shot transfer. Chronos-based TSFMs show strong generalization to astronomical data and achieve state-of-the-art performance on OOD detection, while hand-crafted features remain highly competitive for clustering and classification; domain-specific Astromer models give limited zero-shot gains. The study advocates a paradigm shift toward generic foundation representations for petascale time-series analysis in upcoming surveys like LSST and publishes embeddings, datasets, and code to enable community-driven progress.

Abstract

Time series foundation models (TSFMs) are increasingly being adopted as highly-capable general-purpose time series representation learners. Although their training corpora are vast, they exclude astronomical time series data. Observations of stars produce peta-scale time series with unique challenges including irregular sampling and heteroskedasticity. We introduce StarEmbed, the first public benchmark for rigorous and standardized evaluation of state-of-the-art TSFMs on stellar time series observations (``light curves''). We benchmark on three scientifically-motivated downstream tasks: unsupervised clustering, supervised classification, and out-of-distribution source detection. StarEmbed integrates a catalog of expert-vetted labels with multi-variate light curves from the Zwicky Transient Facility, yielding ~40k hand-labeled light curves spread across seven astrophysical classes. We evaluate the zero-shot representation capabilities of three TSFMs (MOIRAI, Chronos, Chronos-Bolt) and a domain-specific transformer (Astromer) against handcrafted feature extraction, the long-standing baseline in the astrophysics literature. Our results demonstrate that these TSFMs, especially the Chronos models, which are trained on data completely unlike the astronomical observations, can outperform established astrophysics-specific baselines in some tasks and effectively generalize to entirely new data. In particular, TSFMs deliver state-of-the-art performance on our out-of-distribution source detection benchmark. With the first benchmark of TSFMs on astronomical time series data, we test the limits of their generalization and motivate a paradigm shift in time-domain astronomy from using task-specific, fully supervised pipelines toward adopting generic foundation model representations for the analysis of peta-scale datasets from forthcoming observatories.

StarEmbed: Benchmarking Time Series Foundation Models on Astronomical Observations of Variable Stars

TL;DR

Time-domain astronomy faces a data deluge of irregular, multivariate light curves that challenge traditional pipelines. StarEmbed provides the first public benchmark to evaluate time-series foundation models on ZTF light curves across seven classes, focusing on unsupervised clustering, supervised classification, and out-of-distribution detection under zero-shot transfer. Chronos-based TSFMs show strong generalization to astronomical data and achieve state-of-the-art performance on OOD detection, while hand-crafted features remain highly competitive for clustering and classification; domain-specific Astromer models give limited zero-shot gains. The study advocates a paradigm shift toward generic foundation representations for petascale time-series analysis in upcoming surveys like LSST and publishes embeddings, datasets, and code to enable community-driven progress.

Abstract

Time series foundation models (TSFMs) are increasingly being adopted as highly-capable general-purpose time series representation learners. Although their training corpora are vast, they exclude astronomical time series data. Observations of stars produce peta-scale time series with unique challenges including irregular sampling and heteroskedasticity. We introduce StarEmbed, the first public benchmark for rigorous and standardized evaluation of state-of-the-art TSFMs on stellar time series observations (``light curves''). We benchmark on three scientifically-motivated downstream tasks: unsupervised clustering, supervised classification, and out-of-distribution source detection. StarEmbed integrates a catalog of expert-vetted labels with multi-variate light curves from the Zwicky Transient Facility, yielding ~40k hand-labeled light curves spread across seven astrophysical classes. We evaluate the zero-shot representation capabilities of three TSFMs (MOIRAI, Chronos, Chronos-Bolt) and a domain-specific transformer (Astromer) against handcrafted feature extraction, the long-standing baseline in the astrophysics literature. Our results demonstrate that these TSFMs, especially the Chronos models, which are trained on data completely unlike the astronomical observations, can outperform established astrophysics-specific baselines in some tasks and effectively generalize to entirely new data. In particular, TSFMs deliver state-of-the-art performance on our out-of-distribution source detection benchmark. With the first benchmark of TSFMs on astronomical time series data, we test the limits of their generalization and motivate a paradigm shift in time-domain astronomy from using task-specific, fully supervised pipelines toward adopting generic foundation model representations for the analysis of peta-scale datasets from forthcoming observatories.

Paper Structure

This paper contains 34 sections, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Example ZTF light curves illustrating unique characteristics of astronomical time series, including multiple passbands, large observational gaps, and heteroskedastic uncertainties. Top panel: Observed light curve of a periodic variable exhibiting typical characteristics of the observations. The inset shows the full $\sim$6.5 yr duration of ZTF observations. Lower panels: Phase-folded light curves highlighting the differing periodic patterns in three different classes. Note that most stars have few $i$ passband observations so we exclude these data from our analysis (see text for further details).
  • Figure 2: Left: F1 Ranking across all baselines with different classifier heads. The Chronos-tiny model consistently outperforms other TSFMs and the domain-specific Astromer models, but the hand-crafted features provide the best overall performance. Right: Confusion matrix of Chronos-tiny + MLP, one of the best performing TSFM-classifier combinations, and the confusion matrix of hand-crafted features with the RF classification, the SOTA baseline in astrophysics. Chronos-tiny yields better performance on most classes (EA, RRd, RS CVn, and LPV), indicating that the TSFM is effectively extracting appropriate information for classification.
  • Figure 3: UMAP projections for each embedding model included in our analysis using the test set. Inset plots at the bottom of each figure show clustering of different classes.
  • Figure 4: Comparison of Embedding Elignment for Astromer-1 (left) and Chronos-tiny (right) models. The plots show the distribution of cosine similarities between light curve embeddings, both within the MACHO survey (green) and across surveys (ZTF vs. MACHO, orange). Astromer-1 exhibits embedding collapse, with cosine similarities approaching $1.0$ no matter within- or cross-survey. This indicates that the model encodes little discriminative structure. In contrast, Chronos-tiny produces more meaningful embeddings. The wider distribution of cosine similarities preserves structural information, and the clear separation between within-survey and cross-survey pairs demonstrates its ability to capture the domain shift between datasets.
  • Figure 5: Confusion matrix of Chronos-tiny on four classifiers.
  • ...and 5 more figures