Table of Contents
Fetching ...

AstroPT: Scaling Large Observation Models for Astronomy

Michael J. Smith, Ryan J. Roberts, Eirini Angeloudi, Marc Huertas-Company

TL;DR

AstroPT tackles the token-scarce yet information-rich regime of astronomical data by training an autoregressive transformer on 8.6 million DESI-LS DR8 galaxy postage stamps. The work demonstrates a saturating neural scaling law, with downstream task performance improving as pretraining compute increases up to a saturation point, and emergent capabilities appearing at moderate model scales. Embeddings exhibit structure in UMAP projections and improve in downstream linear probes, indicating physically informative representations. By releasing code, weights, and the underlying dataset under an MIT license, the authors advocate open collaboration to accelerate the development of Large Observation Models in astronomy.

Abstract

This work presents AstroPT, an autoregressive pretrained transformer developed with astronomical use-cases in mind. The AstroPT models presented here have been pretrained on 8.6 million $512 \times 512$ pixel $grz$-band galaxy postage stamp observations from the DESI Legacy Survey DR8. We train a selection of foundation models of increasing size from 1 million to 2.1 billion parameters, and find that AstroPT follows a similar saturating log-log scaling law to textual models. We also find that the models' performances on downstream tasks as measured by linear probing improves with model size up to the model parameter saturation point. We believe that collaborative community development paves the best route towards realising an open source `Large Observation Model' -- a model trained on data taken from the observational sciences at the scale seen in natural language processing. To this end, we release the source code, weights, and dataset for AstroPT under the MIT license, and invite potential collaborators to join us in collectively building and researching these models.

AstroPT: Scaling Large Observation Models for Astronomy

TL;DR

AstroPT tackles the token-scarce yet information-rich regime of astronomical data by training an autoregressive transformer on 8.6 million DESI-LS DR8 galaxy postage stamps. The work demonstrates a saturating neural scaling law, with downstream task performance improving as pretraining compute increases up to a saturation point, and emergent capabilities appearing at moderate model scales. Embeddings exhibit structure in UMAP projections and improve in downstream linear probes, indicating physically informative representations. By releasing code, weights, and the underlying dataset under an MIT license, the authors advocate open collaboration to accelerate the development of Large Observation Models in astronomy.

Abstract

This work presents AstroPT, an autoregressive pretrained transformer developed with astronomical use-cases in mind. The AstroPT models presented here have been pretrained on 8.6 million pixel -band galaxy postage stamp observations from the DESI Legacy Survey DR8. We train a selection of foundation models of increasing size from 1 million to 2.1 billion parameters, and find that AstroPT follows a similar saturating log-log scaling law to textual models. We also find that the models' performances on downstream tasks as measured by linear probing improves with model size up to the model parameter saturation point. We believe that collaborative community development paves the best route towards realising an open source `Large Observation Model' -- a model trained on data taken from the observational sciences at the scale seen in natural language processing. To this end, we release the source code, weights, and dataset for AstroPT under the MIT license, and invite potential collaborators to join us in collectively building and researching these models.
Paper Structure (8 sections, 4 figures, 1 table)

This paper contains 8 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: In this work we train AstroPT on the surrogate task of predicting the next token in a 'spiralised' sequence of galaxy image patches. The above image shows the token feed order. As the galaxies are in the centre of each postage stamp, this set up allows us to seamlessly pretrain and run inference on differently sized galaxy postage stamps.
  • Figure 2: Validation set losses over our full training runs. The left plot shows the validation loss per training floating point operation (FLOP), and the right plot shows the validation loss per $16 \times 16$ image patch token seen. Each run is labelled with the total neural parameter count as crossmatched in Tab. \ref{['tab_hparam']}.
  • Figure 3: Results from our AstroPT-1M embedding UMAP projections (upper), and AstroPT-89M embedding UMAP projections (lower). We colour the hex bins in both plots with a selected set of emergent physical properties of the galaxies. We find significant structure, signifying that the model has learned physically meaningful representations of the dataset. In the above plots '$M_g$' and '$M_z$' are the absolute magnitudes in the $g$ and $z$ bands, 'mean sSFR' is the mean specific star formation rate, and '$M_*$' is the stellar mass. 'smooth?', 'disc?', 'artefact?', 'edge on?' and 'tight spiral?' are Galaxy Zoo survey responses for these morphological features. Our metadata sources are described further in §\ref{['sec_dataset']}.
  • Figure 4: Here we show our relative linear probe performances per pretraining FLOP spent on a selection of scientifically-meaningful downstream tasks. The markers are coloured according to the models' parameter counts. We run a Spearman's $\rho$ fit and find in all cases a strong positive correlation between downstream task performance and model size, meaning that a larger model has more informative embeddings. In this plot '$M_g$' and '$M_z$' are the absolute magnitudes in the $g$ and $z$ bands, 'mean sSFR' is the mean specific star formation rate, and '$M_*$' is the stellar mass. 'smooth?', 'disc?', 'artefact?', 'edge on?' and 'tight spiral?' are Galaxy Zoo survey responses for these morphological features. Our metadata sources are described further in §\ref{['sec_dataset']}.