SPT: Spectral Transformer for Red Giant Stars Age and Mass Estimation

Mengmeng Zhang; Fan Wu; Yude Bu; Shanshan Li; Zhenping Yi; Meng Liu; Xiaoming Kong

SPT: Spectral Transformer for Red Giant Stars Age and Mass Estimation

Mengmeng Zhang, Fan Wu, Yude Bu, Shanshan Li, Zhenping Yi, Meng Liu, Xiaoming Kong

TL;DR

We address the problem of estimating red giant ages and masses from spectroscopic data without heavy reliance on isochrone degeneracy or long-term asteroseismic data. Our approach, the Spectral Transformer (SPT), uses a Multi-head Hadamard Self-Attention backbone with a Mahalanobis distance loss and Monte Carlo dropout to predict age and mass directly from spectra; trained on 3,880 LAMOST DR9 red giant spectra with asteroseismic labels, it achieves $\Delta_P=17.64\%$ for age and $\Delta_P=6.61\%$ for mass, outperforming several baselines. The model provides per-prediction uncertainties, and its results are consistent with asteroseismology and isochrone benchmarks, including open clusters; this enables more robust Galactic archaeology studies. Future work will leverage CSST and LSST datasets to further improve accuracy and applicability.

Abstract

The age and mass of red giants are essential for understanding the structure and evolution of the Milky Way. Traditional isochrone methods for these estimations are inherently limited due to overlapping isochrones in the Hertzsprung-Russell diagram, while asteroseismology, though more precise, requires high-precision, long-term observations. In response to these challenges, we developed a novel framework, Spectral Transformer (SPT), to predict the age and mass of red giants aligned with asteroseismology from their spectra. A key component of SPT, the Multi-head Hadamard Self-Attention mechanism, designed specifically for spectra, can capture complex relationships across different wavelength. Further, we introduced a Mahalanobis distance-based loss function to address scale imbalance and interaction mode loss, and incorporated Monte Carlo dropout for quantitative analysis of prediction uncertainty.Trained and tested on 3,880 red giant spectra from LAMOST, the SPT achieved remarkable age and mass estimations with average percentage errors of 17.64% and 6.61%, respectively, and provided uncertainties for each corresponding prediction. The results significantly outperform those of traditional machine learning algorithms and demonstrate a high level of consistency with asteroseismology methods and isochrone fitting techniques. In the future, our work will leverage datasets from the Chinese Space Station Telescope and the Large Synoptic Survey Telescope to enhance the precision of the model and broaden its applicability in the field of astronomy and astrophysics.

SPT: Spectral Transformer for Red Giant Stars Age and Mass Estimation

TL;DR

for age and

for mass, outperforming several baselines. The model provides per-prediction uncertainties, and its results are consistent with asteroseismology and isochrone benchmarks, including open clusters; this enables more robust Galactic archaeology studies. Future work will leverage CSST and LSST datasets to further improve accuracy and applicability.

Abstract

Paper Structure (21 sections, 15 equations, 12 figures, 2 tables)

This paper contains 21 sections, 15 equations, 12 figures, 2 tables.

Introduction
Methods
Spectral Transformer (SPT)
Multi-head Hadamard Self-Attention
Mahalanobis Distance Loss Function
Monte Carlo Dropout
Prediction Valuation Metrics
Training Details
Data
Datasets
Data Processing
Results
Age and Mass Estimation
Uncertainty Analysis
Validation
...and 6 more sections

Figures (12)

Figure 1: Entire framework overview. The data acquisition section (left panel) describes the sources of data for the model. This primarily comprises red giant spectra collected by the LAMOST telescope, and the corresponding ages and masses obtained through asteroseismology methods (primarily determined by the mean large frequency, $\Delta\nu$, and the frequency of maximum power, $\nu_{\text{max}}$). These data serve as the foundation for training and testing within the learning framework (right panel). The spectra are input into the model after dimensionality reduction via PCA method, while the ages and masses serve as the labels for the model. The input undergoes a transformation through a linear projection layer to generate the embedded input. Subsequently, the SPT backbone is responsible for feature extraction, comprising $L$ SPT blocks (outlined by dashed lines). Each SPT block consists of two BatchNorm (Batch Normalization) layers , a Multi-head HSA layer, a FFN (feedforward network, with two linear layers separated by a GeLU activation), and $\bigoplus$ (residual connection). The high-level semantic features extracted from the SPT backbone are then fed into the MLP head layer, a fully connected multi-layer neural network. The output is the final result of the model, generated by the MLP head. During the forward propagation process, the model computes the predicted values and loss, while during the backward propagation process, it calculates gradients and updates parameters to optimize predictive performance.
Figure 2: Multi-head HSA mechanism. Left panel: HSA mechanism, corresponding to the blue box in the right panel (Multi-head HSA). In this mechanism, $Q$, $K$, and $V$, representing query, key, and value respectively, are obtained by applying different linear projections to the input, and their dimensions are all $d^{*}=64$ in this paper. Softmax is a mathematical function that converts a vector of numbers into a vector of probabilities. $\bigodot$ represents the Hadamard Product. Scale is used to scale the calculated attention scores. Enhanced softmax is an improved version of the softmax function proposed by us. In the right panel, the Linear layer represents a linear projection, X represents the input, and different $Q_i$, $K_i$, $V_i$ are obtained through the linear layer, where $i$ indicates the index of the "head". The "Concat" operation concatenates features from different heads, and then the final result is output through the linear layer.
Figure 3: Preference for loss function. To illustrate the effectiveness of different loss functions, we randomly selected a true label ("True label") from the dataset and produced two predictions ("Prediction 1" and "Prediction 2"), represented by red, green, and blue triangles, respectively. The gray dots represent the distribution of data samples. The dashed circle is drawn with a radius equal to the Euclidean distance between the "True Label" and "Prediction 1". Left panel: Original data distribution. Middle panel: Data distribution after z-score normalization is applied separately to age and mass. Right panel: Data distribution post-normalization using the Mahalanobis distance.
Figure 4: Types of anomalous spectra excluded. The red line shows the trend of flux as a function of wavelength. The shaded regions in each panel show the positions of the anomalies. Top left panel: Oscillations. The spectrum shows the repetitive variation of flux about a central value. Top right panel: Missing values. These spectra lack values in certain wavelength regions. Bottom left panel: Outliers. The spectrum has pronounced peaks or deep troughs. Bottom right panel: Negative values. The spectra display negative values at certain wavelengths.
Figure 5: Pre-and-post data distribution. The solid orange line depicts the original data distribution, while the blue dashed line shows the data distribution after removing outlier spectra. Left panel: Changes in the density distribution of age. Right panel: Changes in the density distribution of mass.
...and 7 more figures

SPT: Spectral Transformer for Red Giant Stars Age and Mass Estimation

TL;DR

Abstract

SPT: Spectral Transformer for Red Giant Stars Age and Mass Estimation

Authors

TL;DR

Abstract

Table of Contents

Figures (12)