Table of Contents
Fetching ...

ATCAT: Astronomical Timeseries CAusal Transformer

Zora Tung

TL;DR

ATCAT introduces a lightweight, transformer-based time-series classifier tailored to LSST-like light curves, delivering state-of-the-art accuracy on ELAsTiCC with LC-only and LC+metadata inputs. It advances light-curve encoding, metadata integration, and local-attention transformers, while enabling unsupervised pretraining, early detection, and calibrated outputs. The method achieves strong performance even with limited labels, and offers substantial throughput improvements suitable for large-scale surveys, with practical implications for follow-up prioritization and anomaly detection. The work also provides guidelines for data standardization, calibration, and futureGenerative opportunities, laying groundwork for cross-survey applicability and scalable time-domain classification.

Abstract

The Legacy Survey of Space and Time (LSST) at the Vera C. Rubin Observatory will capture light curves (LCs) for 10 billion sources and produce millions of transient candidates per night, necessitating scalable, accurate, and efficient classification. To prepare the community for this scale of data, the Extended LSST Astronomical Time-Series Classification Challenge (ELAsTiCC) sought to simulate a diversity of LSST-like time-domain events. Using a small transformer-based model and refined light curve encoding logic, we present a new state of the art classification performance on ELAsTiCC, with 71.8% F1 on LC-only classifications, and 89.8% F1 on LC+metadata classifications. Previous state of the art was 65.5% F1 for LC-only, and for LC+metadata, 84% F1 with a different setup and 83.5% F1 with a directly comparable setup. Our model outperforms previous state-of-the-art models for fine-grained early detection at all time cutoffs, which should help prioritize candidate transients for follow-up observations. We demonstrate label-efficient training by removing labels from 90% of the training data (chosen uniformly at random), and compensate by leveraging regularization, bootstrap ensembling, and unsupervised pretraining. Even with only 10% of the labeled data, we achieve 67.4% F1 on LC-only and 87.1% F1 on LC+metadata, validating an approach that should help mitigate synthetic and observational data drift, and improve classification on tasks with less labeled data. We find that our base model is poorly calibrated via reliability diagrams, and correct it at a minimal cost to overall performance, enabling selections by classification precision. Finally, our GPU-optimized implementation is 9x faster than other state-of-the-art ELAsTiCC models, and can run inference at ~33000 LCs/s on a consumer-grade GPU, making it suitable for large-scale applications, and less expensive to train.

ATCAT: Astronomical Timeseries CAusal Transformer

TL;DR

ATCAT introduces a lightweight, transformer-based time-series classifier tailored to LSST-like light curves, delivering state-of-the-art accuracy on ELAsTiCC with LC-only and LC+metadata inputs. It advances light-curve encoding, metadata integration, and local-attention transformers, while enabling unsupervised pretraining, early detection, and calibrated outputs. The method achieves strong performance even with limited labels, and offers substantial throughput improvements suitable for large-scale surveys, with practical implications for follow-up prioritization and anomaly detection. The work also provides guidelines for data standardization, calibration, and futureGenerative opportunities, laying groundwork for cross-survey applicability and scalable time-domain classification.

Abstract

The Legacy Survey of Space and Time (LSST) at the Vera C. Rubin Observatory will capture light curves (LCs) for 10 billion sources and produce millions of transient candidates per night, necessitating scalable, accurate, and efficient classification. To prepare the community for this scale of data, the Extended LSST Astronomical Time-Series Classification Challenge (ELAsTiCC) sought to simulate a diversity of LSST-like time-domain events. Using a small transformer-based model and refined light curve encoding logic, we present a new state of the art classification performance on ELAsTiCC, with 71.8% F1 on LC-only classifications, and 89.8% F1 on LC+metadata classifications. Previous state of the art was 65.5% F1 for LC-only, and for LC+metadata, 84% F1 with a different setup and 83.5% F1 with a directly comparable setup. Our model outperforms previous state-of-the-art models for fine-grained early detection at all time cutoffs, which should help prioritize candidate transients for follow-up observations. We demonstrate label-efficient training by removing labels from 90% of the training data (chosen uniformly at random), and compensate by leveraging regularization, bootstrap ensembling, and unsupervised pretraining. Even with only 10% of the labeled data, we achieve 67.4% F1 on LC-only and 87.1% F1 on LC+metadata, validating an approach that should help mitigate synthetic and observational data drift, and improve classification on tasks with less labeled data. We find that our base model is poorly calibrated via reliability diagrams, and correct it at a minimal cost to overall performance, enabling selections by classification precision. Finally, our GPU-optimized implementation is 9x faster than other state-of-the-art ELAsTiCC models, and can run inference at ~33000 LCs/s on a consumer-grade GPU, making it suitable for large-scale applications, and less expensive to train.

Paper Structure

This paper contains 47 sections, 17 equations, 16 figures, 8 tables.

Figures (16)

  • Figure 1: High-level schematic of dataflow for pretraining and fine-tuning (training of the classification model). In implementation, all grey nodes except preprocessing are run at runtime, with instrumentation to double-check their correctness. See text.
  • Figure 2: Our model architecture. Our model is a transformer with 4 layers. The first two layers use local attention, the first with a 1-day threshold, and the second with a 10-day threshold. Metadata is encoded as the first token in the sequence, allowing LC points in the global attention layers to attend to it.
  • Figure 3: Local attention connections. We visualize the connections in our local attention mechanism for a specific ELAsTiCC example (this one is a Type-Ia supernova). The first transformer layer, featuring a local attention mechanism with a threshold of 1 day, allows the two points in channels 3 (i) and 4 (z) near $t=27$ to attend to each other, and likewise the cluster of 3 points at the end (near $t=37$), but only causally (bottom left figure). For the first pair, this is represented by the point at query index 3 being allowed to attend to key/value index 2 (presence of a black dot). For the second layer, with a threshold of 10 days (bottom right figure), the first two points can attend to each other, and all of the other points as well, but also only causally.
  • Figure 4: Nonlinear scaling of flux values for generative modeling. We squash the flux values to a much smaller range, by “gluing” together a tanh function (around 0) and log function, matching the first derivative and intercept point. We also pre-scale flux by 1/10. This keeps the response curve not too flat for the majority of values, while scaling the max value from 2,568,897 to 6. The first part of the figure is the response curve, the bottom is a histogram (aligned in X-axis values) of all training flux values from all light curves (not showing the long tail of extreme values). As elsewhere, we are looking at calibrated flux values after mean field subtraction, which can be negative.
  • Figure 5: Mixture-of-Gaussian components for generative modeling. We graph the components of our mixture of Gaussians model, showing actual frequencies of scaled flux values, and the sum of Gaussian components (black line, each component equally weighted). For our actual models, we used 64 and 128-component models; here we show only 32 for ease of visualization. The black line matches our histogram fairly well, as desired, with some deviation around 2.4 for this 32-component model. Incorrect choices of sigmas (component width) will result in the black line being jagged, or components being too broad. Individual components are the many colored lines.
  • ...and 11 more figures