Table of Contents
Fetching ...

The Scaling Law in Stellar Light Curves

Jia-Shu Pan, Yuan-Sen Ting, Yang Huang, Jie Yu, Ji-Feng Liu

TL;DR

The paper investigates whether scaling laws from other domains apply to astronomical time series by training GPT-2–style autoregressive transformers in a self-supervised fashion on Kepler stellar light curves. It demonstrates that pretraining and downstream performance improve with model size up to $1.5 imes10^9$ parameters, without a visible plateau, and that latent representations enable log $g$ inference with 3–10× greater sample efficiency than a supervised state-of-the-art method. The approach uses a simple GPT-2 framework with MLP-based tokenization and Huber loss for next-token regression, achieving strong scaling even with a modest 0.7B pretraining tokens. These findings suggest that large-scale autoregressive models can serve as robust foundational representations for astronomical time series, offering a scalable path to analyze data from upcoming surveys like Rubin Observatory, LSST, and SiTian.

Abstract

Analyzing time series of fluxes from stars, known as stellar light curves, can reveal valuable information about stellar properties. However, most current methods rely on extracting summary statistics, and studies using deep learning have been limited to supervised approaches. In this research, we investigate the scaling law properties that emerge when learning from astronomical time series data using self-supervised techniques. By employing the GPT-2 architecture, we show the learned representation improves as the number of parameters increases from $10^4$ to $10^9$, with no signs of performance plateauing. We demonstrate that a self-supervised Transformer model achieves 3-10 times the sample efficiency compared to the state-of-the-art supervised learning model when inferring the surface gravity of stars as a downstream task. Our research lays the groundwork for analyzing stellar light curves by examining them through large-scale auto-regressive generative models.

The Scaling Law in Stellar Light Curves

TL;DR

The paper investigates whether scaling laws from other domains apply to astronomical time series by training GPT-2–style autoregressive transformers in a self-supervised fashion on Kepler stellar light curves. It demonstrates that pretraining and downstream performance improve with model size up to parameters, without a visible plateau, and that latent representations enable log inference with 3–10× greater sample efficiency than a supervised state-of-the-art method. The approach uses a simple GPT-2 framework with MLP-based tokenization and Huber loss for next-token regression, achieving strong scaling even with a modest 0.7B pretraining tokens. These findings suggest that large-scale autoregressive models can serve as robust foundational representations for astronomical time series, offering a scalable path to analyze data from upcoming surveys like Rubin Observatory, LSST, and SiTian.

Abstract

Analyzing time series of fluxes from stars, known as stellar light curves, can reveal valuable information about stellar properties. However, most current methods rely on extracting summary statistics, and studies using deep learning have been limited to supervised approaches. In this research, we investigate the scaling law properties that emerge when learning from astronomical time series data using self-supervised techniques. By employing the GPT-2 architecture, we show the learned representation improves as the number of parameters increases from to , with no signs of performance plateauing. We demonstrate that a self-supervised Transformer model achieves 3-10 times the sample efficiency compared to the state-of-the-art supervised learning model when inferring the surface gravity of stars as a downstream task. Our research lays the groundwork for analyzing stellar light curves by examining them through large-scale auto-regressive generative models.
Paper Structure (8 sections, 5 figures, 1 table)

This paper contains 8 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Autoregressive one-step prediction from our GPT-2 xl model with 1.5B parameters. The different panels show four representative light curves with varying surface gravity ($\log g$) values of stars. We perform the next-step prediction with the highest likelihood of the light curves, conditioning on all the previous $N$ steps. Only the part beyond the grey shaded is predicted. The generative model demonstrates the ability to capture the general trend of the light curves, leading to a robust representation of the light curves. Can improve on the plot with just showing one global x-axis and y-axis label, with one out of the figure legend, and label the logg in plot. And grey out the region which we never predict. Also no frame for the legend. The font can be larger for all plots
  • Figure 2: The emergence of scaling law. We train autoregressive generative models on stellar light curves with different complexities, ranging from $10^4$ to $10^9$ parameters. The different lines show the training loss (MSE of the next token prediction) as a function of the computational cost for different model sizes. We truncate the training of the models when there is no improvement in the loss after [xx] epochs. The prediction loss plateaus at increasingly more precise values (smaller MSE) for larger models, demonstrating that the scaling law also applies to Transformer-based autoregressive generative models when applied to astronomical time series data. I don't understand why there is a solid color vs weaker color. The legend should not include the equation for the daslined lines. In fact, from this plot, I don't see there is any planetauing for any model -- it seems like for the smaller models, it appears that if I train long enough, it will keep decreasig, so I am not sure I fully follow this plot. You can also just multiply the y-axis with some number and just say something MSE [in 0.001]. Also not frame for the legend. The two largest models are still missing? I thinke the y-axis will make more physical sense if we take the square root.
  • Figure 3: The latent representations extracted from the GPT-2 (large) models. The native representation is a vector with 768 dimensions, which we subsequently visualize in 2D with the UMAP projection. Different columns show the latent embedding representation extracted at different depths in the generative model. The left panels show the case where only the last token is extracted as the embedding, while the right panels show the weighted average of all 80 tokens at any given layer. Embeddings in the deeper layer show a higher level of abstraction. The points are color-coded by the surface gravity values of the stars.
  • Figure 4: The downstream $\log g$ inference exhibits the scaling law. The solid orange line shows the mean square error of the inference of $\log g$ derived from the representation of the GPT-2 models through a final MLP head. The MSE is plotted as a function of the number of parameters. We also plot the MSE loss from the next-token prediction, as illustrated in Figure \ref{['fig:loss_flops']}, as the blue solid line. The downstream $\log g$ prediction closely traces the next-token prediction. We also compare the $\log g$ prediction with the state-of-the-art Transformer models trained through supervised learning astroconformer and found that the generative approach surpasses the supervised learning SOTA upon having $10^7$ parameters. I think the log g y-axis should on the left. Instead of prediction MSE, I would call pretraining MSE, or next-token prediction MSE. instead of MLP in the legend, shoud call it GPT-2 or auto-regressive generative model. Instead of Astroconformer, you can write supervised learning SOTA. Why the plot stop at $10^8$ parameters instead of $10^9$. Did not realize we beat the model even at $10^7$. The introduction and abstract needs to change a little to reflect this. I don't think the Astroconformer line requires symbol, just the line is good. Isn't sqrt of 0.07 = 0.25? This looks not good?
  • Figure 5: What MSE is plotted here, why the value is so different from the previous figure. If you want to keep this plot, just combine it with the previous figure as a lower bottom panel. I also would not call this MLP, see comment above. Instead of training sample -- I think it is more reasonable to ask number of labeled stars, since individual stars can have good number of patches.