Table of Contents
Fetching ...

SpecCLIP: Aligning and Translating Spectroscopic Measurements for Stars

Xiaosheng Zhao, Yang Huang, Guirong Xue, Xiao Kong, Jifeng Liu, Xiaoyu Tang, Timothy C. Beers, Yuan-Sen Ting, A-Li Luo

TL;DR

<3-5 sentence high-level summary> SpecCLIP tackles the challenge of heterogeneous stellar spectroscopy by learning cross-instrument representations through a CLIP-inspired framework that aligns LAMOST LRS and Gaia XP spectra. It introduces modality-specific foundation models, spectrum-aware decoders, and a suite of loss terms that preserve spectrum information while enabling cross-modal translation and retrieval. The approach demonstrates strong parameter-estimation performance, robust cross-modal predictions, and useful anomaly-detection signals, with SBI providing richer uncertainty quantification in some tasks. By enabling efficient, few-shot parameter inferences and cross-instrument calibration, SpecCLIP offers a scalable path toward unified stellar-parameter catalogs across diverse spectroscopic surveys.

Abstract

In recent years, large language models (LLMs) have transformed natural language understanding through vast datasets and large-scale parameterization. Inspired by this success, we present SpecCLIP, a foundation model framework that extends LLM-inspired methodologies to stellar spectral analysis. Stellar spectra, akin to structured language, encode rich physical and chemical information about stars. By training foundation models on large-scale spectral datasets, our goal is to learn robust and informative embeddings that support diverse downstream applications. As a proof of concept, SpecCLIP involves pre-training on two spectral types--LAMOST low-resolution and Gaia XP--followed by contrastive alignment using the CLIP (Contrastive Language-Image Pre-training) framework, adapted to associate spectra from different instruments. This alignment is complemented by auxiliary decoders that preserve spectrum-specific information and enable translation (prediction) between spectral types, with the former achieved by maximizing mutual information between embeddings and input spectra. The result is a cross-spectrum framework enabling intrinsic calibration and flexible applications across instruments. We demonstrate that fine-tuning these models on moderate-sized labeled datasets improves adaptability to tasks such as stellar-parameter estimation and chemical-abundance determination. SpecCLIP also enhances the accuracy and precision of parameter estimates benchmarked against external survey data. Additionally, its similarity search and cross-spectrum prediction capabilities offer potential for anomaly detection. Our results suggest that contrastively trained foundation models enriched with spectrum-aware decoders can advance precision stellar spectroscopy. Our code SpecCLIP is publicly available at https://github.com/Xiaosheng-Zhao/SpecCLIP

SpecCLIP: Aligning and Translating Spectroscopic Measurements for Stars

TL;DR

<3-5 sentence high-level summary> SpecCLIP tackles the challenge of heterogeneous stellar spectroscopy by learning cross-instrument representations through a CLIP-inspired framework that aligns LAMOST LRS and Gaia XP spectra. It introduces modality-specific foundation models, spectrum-aware decoders, and a suite of loss terms that preserve spectrum information while enabling cross-modal translation and retrieval. The approach demonstrates strong parameter-estimation performance, robust cross-modal predictions, and useful anomaly-detection signals, with SBI providing richer uncertainty quantification in some tasks. By enabling efficient, few-shot parameter inferences and cross-instrument calibration, SpecCLIP offers a scalable path toward unified stellar-parameter catalogs across diverse spectroscopic surveys.

Abstract

In recent years, large language models (LLMs) have transformed natural language understanding through vast datasets and large-scale parameterization. Inspired by this success, we present SpecCLIP, a foundation model framework that extends LLM-inspired methodologies to stellar spectral analysis. Stellar spectra, akin to structured language, encode rich physical and chemical information about stars. By training foundation models on large-scale spectral datasets, our goal is to learn robust and informative embeddings that support diverse downstream applications. As a proof of concept, SpecCLIP involves pre-training on two spectral types--LAMOST low-resolution and Gaia XP--followed by contrastive alignment using the CLIP (Contrastive Language-Image Pre-training) framework, adapted to associate spectra from different instruments. This alignment is complemented by auxiliary decoders that preserve spectrum-specific information and enable translation (prediction) between spectral types, with the former achieved by maximizing mutual information between embeddings and input spectra. The result is a cross-spectrum framework enabling intrinsic calibration and flexible applications across instruments. We demonstrate that fine-tuning these models on moderate-sized labeled datasets improves adaptability to tasks such as stellar-parameter estimation and chemical-abundance determination. SpecCLIP also enhances the accuracy and precision of parameter estimates benchmarked against external survey data. Additionally, its similarity search and cross-spectrum prediction capabilities offer potential for anomaly detection. Our results suggest that contrastively trained foundation models enriched with spectrum-aware decoders can advance precision stellar spectroscopy. Our code SpecCLIP is publicly available at https://github.com/Xiaosheng-Zhao/SpecCLIP

Paper Structure

This paper contains 57 sections, 21 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Architecture of SpecCLIP. Two types of spectra (Shown here are examples of normalized LAMOST LRS and Gaia XP spectra varying with metallicities) are passed through two pre-trained spectral foundation models to obtain embeddings, where the pre-trained models can be either transformer-based networks or multilayer perceptron (MLP)-based autoencoders. These embeddings are then projected into a joint embedding space, which may optionally be split into a shared and a non-shared subspace. Based on the projected embeddings, we construct various loss functions to enable CLIP-like contrastive learning, cross-modal prediction, and spectral reconstruction. The combination of these loss functions results in five model variants: a baseline CLIP without decoders, CLIP-r with only reconstruction decoders, CLIP-p with only prediction decoders, CLIP-pr with full decoders, and CLIP-split with full decoders and an explicit separation of shared and non-shared embedding spaces (see Section \ref{['sec:variants']})
  • Figure 2: Comparison between the LAMOST catalog and SpecCLIP models (including the pre-trained LRS model and the LRS branch of the CLIP-split model). From top to bottom: The radial velocity (RV) comparison as a function of the GALAH labels; [Fe/H] comparison as a function of the DESI labels; [Fe/H] comparison as a function of the GALAH labels; and [Fe/H] comparison as a function of the GALAH labels with input spectra shifted to the rest frame using the predicted RVs from the corresponding models in the top row. For RV, which is inferred using the SBI downstream model, the pre-trained LRS model and CLIP-split model have slightly larger scatter but smaller bias, compared with the LAMOST catalog; For [Fe/H], inferred using MLP downstream models (as with all other figures), the pre-trained LRS model and CLIP-split model gives either smaller scatter over the metal-poor region (referring to DESI labels) or overall smaller scatter and bias (referring to GALAH labels). The RV-corrected spectra result in similar [Fe/H] prediction performance, suggesting that the trained MLP models are relatively robust to modest Doppler shifts in the LAMOST LRS spectra. The dashed lines are the one-to-one lines. The numbers in the upper left of each panel are the mean offsets and standard deviation of the residuals (y-axis minus x-axis).
  • Figure 3: Comparison of [Fe/H] estimates from SpecCLIP (pre-trained XP model and XP branch of the CLIP-pr model) with reference labels from GALAH (top) and Gaia RVS (bottom). Both models correlate well with reference labels, with the CLIP-pr model yielding lower scatter and bias. The dashed lines are the one-to-one lines. The numbers is the upper left of each panel are the mean offsets and standard deviation of the residuals.
  • Figure 4: Spatial density distributions of extremely metal-poor stars ($-5<{\rm [Fe/H]}<-3$) derived from SpecCLIP (CLIP-pr model) in Galactic coordinates, showing a clear "metal-poor old heart" of our Galaxy.
  • Figure 5: Two examples of in-modal retrieval, cross-modal retrieval, cross-modal prediction, and the LAMOST (Gaia) spectra corresponding to Gaia (LAMOST) in-modal retrieval. The similarity scores are defined in the projected embedding space.
  • ...and 4 more figures