Table of Contents
Fetching ...

Stellar parameter prediction and spectral simulation using machine learning

Vojtěch Cvrček, Martino Romaniello, Radim Šára, Wolfram Freudling, Pascal Ballester

TL;DR

This work addresses the need for fast, accurate extraction of stellar parameters from high-resolution spectra while simultaneously enabling realistic spectral simulations. By combining supervised and semi-supervised autoencoder architectures with a physics-informed spectral simulator, the authors achieve mean Teff errors around 50 K and metallicity/log g precisions near 0.02–0.04 dex, while dramatically reducing per-spectrum processing time to the millisecond regime on GPUs. The approach leverages a semi-supervised latent space that separates label-informed and unknown factors, and employs novel generative metrics (RVIS and GIS) to quantify cause-and-effect fidelity in spectral generation. The results show that label-aware models can rival traditional methods in accuracy and scale efficiently to massive surveys, with simulated data providing meaningful benefits when labeled data are sparse, marking a practical path toward high-throughput spectroscopic analyses.

Abstract

We applied machine learning to the entire data history of ESO's High Accuracy Radial Velocity Planet Searcher (HARPS) instrument. Our primary goal was to recover the physical properties of the observed objects, with a secondary emphasis on simulating spectra. We systematically investigated the impact of various factors on the accuracy and fidelity of the results, including the use of simulated data, the effect of varying amounts of real training data, network architectures, and learning paradigms. Our approach integrates supervised and unsupervised learning techniques within autoencoder frameworks. Our methodology leverages an existing simulation model that utilizes a library of existing stellar spectra in which the emerging flux is computed from first principles rooted in physics and a HARPS instrument model to generate simulated spectra comparable to observational data. We trained standard and variational autoencoders on HARPS data to predict spectral parameters and generate spectra. Our models excel at predicting spectral parameters and compressing real spectra, and they achieved a mean prediction error of approximately 50 K for effective temperatures, making them relevant for most astrophysical applications. Furthermore, the models predict metallicity ([M/H]) and surface gravity (log g) with an accuracy of approximately 0.03 dex and 0.04 dex, respectively, underscoring their broad applicability in astrophysical research. The models' computational efficiency, with processing times of 779.6 ms on CPU and 3.97 ms on GPU, makes them valuable for high-throughput applications like massive spectroscopic surveys and large archival studies. By achieving accuracy comparable to classical methods with significantly reduced computation time, our methodology enhances the scope and efficiency of spectroscopic analysis.

Stellar parameter prediction and spectral simulation using machine learning

TL;DR

This work addresses the need for fast, accurate extraction of stellar parameters from high-resolution spectra while simultaneously enabling realistic spectral simulations. By combining supervised and semi-supervised autoencoder architectures with a physics-informed spectral simulator, the authors achieve mean Teff errors around 50 K and metallicity/log g precisions near 0.02–0.04 dex, while dramatically reducing per-spectrum processing time to the millisecond regime on GPUs. The approach leverages a semi-supervised latent space that separates label-informed and unknown factors, and employs novel generative metrics (RVIS and GIS) to quantify cause-and-effect fidelity in spectral generation. The results show that label-aware models can rival traditional methods in accuracy and scale efficiently to massive surveys, with simulated data providing meaningful benefits when labeled data are sparse, marking a practical path toward high-throughput spectroscopic analyses.

Abstract

We applied machine learning to the entire data history of ESO's High Accuracy Radial Velocity Planet Searcher (HARPS) instrument. Our primary goal was to recover the physical properties of the observed objects, with a secondary emphasis on simulating spectra. We systematically investigated the impact of various factors on the accuracy and fidelity of the results, including the use of simulated data, the effect of varying amounts of real training data, network architectures, and learning paradigms. Our approach integrates supervised and unsupervised learning techniques within autoencoder frameworks. Our methodology leverages an existing simulation model that utilizes a library of existing stellar spectra in which the emerging flux is computed from first principles rooted in physics and a HARPS instrument model to generate simulated spectra comparable to observational data. We trained standard and variational autoencoders on HARPS data to predict spectral parameters and generate spectra. Our models excel at predicting spectral parameters and compressing real spectra, and they achieved a mean prediction error of approximately 50 K for effective temperatures, making them relevant for most astrophysical applications. Furthermore, the models predict metallicity ([M/H]) and surface gravity (log g) with an accuracy of approximately 0.03 dex and 0.04 dex, respectively, underscoring their broad applicability in astrophysical research. The models' computational efficiency, with processing times of 779.6 ms on CPU and 3.97 ms on GPU, makes them valuable for high-throughput applications like massive spectroscopic surveys and large archival studies. By achieving accuracy comparable to classical methods with significantly reduced computation time, our methodology enhances the scope and efficiency of spectroscopic analysis.

Paper Structure

This paper contains 36 sections, 21 equations, 11 figures, 14 tables.

Figures (11)

  • Figure 1: Distribution of effective temperature, surface gravity, and metallicity in the HARPS dataset.
  • Figure 2: Relation between bottleneck size and RVIS. The vector $\mathbf{l}$ size is fixed to seven. In addition, vector $\mathbf{u}$, which represents the unsupervised portion of the latent representation, varies in size from zero to 13.
  • Figure 3: Kernel density estimation plots illustrating the distribution of absolute error differences. The KDE bandwidth is determined by Scott's rule and is clipped between the first and 99th percentiles. The models are supervised encoders (real and mixed data), supervised AE (bottleneck = 9), supervised infoVAE (bottleneck=32), and VAE (bottleneck = 128) Nima_2021.
  • Figure 4: Residual plots for effective temperature ($T_\text{eff}$), metallicity ([M/H]), and surface gravity ($\log g$) predictions across different models.
  • Figure 5: Comparison of reconstruction capabilities between CNN and ResNet models using the ETC dataset. The elevated error rate for the CNN model, as shown in Table \ref{['tab:simulations']}, is attributed to missing absorption lines.
  • ...and 6 more figures