Table of Contents
Fetching ...

Shared Stochastic Gaussian Process Latent Variable Models: A Multi-modal Generative Model for Quasar Spectra

Vidhi Lalchand, Anna-Christina Eilers

TL;DR

The paper tackles the challenge of learning from large, noisy, and heterogeneous astronomical data by introducing a scalable Shared stochastic GPLVM that jointly models two observation spaces—quasar spectra and derived physical labels—through a common latent representation. By extending stochastic variational GPLVM to a shared latent space with two independent GP decoders, it enables missing-data handling and cross-modal generation and prediction (X → Z → Y). The method is validated on SDSS quasar data (tens of thousands of objects), showing improved reconstruction and predictive performance over baselines, and demonstrates the ability to generate spectra conditioned on synthesized labels, as well as inferring physical properties from spectra alone. This framework opens new avenues for simultaneous spectral analysis, physical inference, and potential standardization of quasars as cosmological probes, while pointing to future work on encoder extensions and kernel design to further enhance applicability and efficiency.

Abstract

This work proposes a scalable probabilistic latent variable model based on Gaussian processes (Lawrence, 2004) in the context of multiple observation spaces. We focus on an application in astrophysics where data sets typically contain both observed spectral features and scientific properties of astrophysical objects such as galaxies or exoplanets. In our application, we study the spectra of very luminous galaxies known as quasars, along with their properties, such as the mass of their central supermassive black hole, accretion rate, and luminosity-resulting in multiple observation spaces. A single data point is then characterized by different classes of observations, each with different likelihoods. Our proposed model extends the baseline stochastic variational Gaussian process latent variable model (GPLVM) introduced by Lalchand et al. (2022) to this setting, proposing a seamless generative model where the quasar spectra and scientific labels can be generated simultaneously using a shared latent space as input to different sets of Gaussian process decoders, one for each observation space. Additionally, this framework enables training in a missing data setting where a large number of dimensions per data point may be unknown or unobserved. We demonstrate high-fidelity reconstructions of the spectra and scientific labels during test-time inference and briefly discuss the scientific interpretations of the results, along with the significance of such a generative model.

Shared Stochastic Gaussian Process Latent Variable Models: A Multi-modal Generative Model for Quasar Spectra

TL;DR

The paper tackles the challenge of learning from large, noisy, and heterogeneous astronomical data by introducing a scalable Shared stochastic GPLVM that jointly models two observation spaces—quasar spectra and derived physical labels—through a common latent representation. By extending stochastic variational GPLVM to a shared latent space with two independent GP decoders, it enables missing-data handling and cross-modal generation and prediction (X → Z → Y). The method is validated on SDSS quasar data (tens of thousands of objects), showing improved reconstruction and predictive performance over baselines, and demonstrates the ability to generate spectra conditioned on synthesized labels, as well as inferring physical properties from spectra alone. This framework opens new avenues for simultaneous spectral analysis, physical inference, and potential standardization of quasars as cosmological probes, while pointing to future work on encoder extensions and kernel design to further enhance applicability and efficiency.

Abstract

This work proposes a scalable probabilistic latent variable model based on Gaussian processes (Lawrence, 2004) in the context of multiple observation spaces. We focus on an application in astrophysics where data sets typically contain both observed spectral features and scientific properties of astrophysical objects such as galaxies or exoplanets. In our application, we study the spectra of very luminous galaxies known as quasars, along with their properties, such as the mass of their central supermassive black hole, accretion rate, and luminosity-resulting in multiple observation spaces. A single data point is then characterized by different classes of observations, each with different likelihoods. Our proposed model extends the baseline stochastic variational Gaussian process latent variable model (GPLVM) introduced by Lalchand et al. (2022) to this setting, proposing a seamless generative model where the quasar spectra and scientific labels can be generated simultaneously using a shared latent space as input to different sets of Gaussian process decoders, one for each observation space. Additionally, this framework enables training in a missing data setting where a large number of dimensions per data point may be unknown or unobserved. We demonstrate high-fidelity reconstructions of the spectra and scientific labels during test-time inference and briefly discuss the scientific interpretations of the results, along with the significance of such a generative model.

Paper Structure

This paper contains 26 sections, 20 equations, 11 figures, 4 tables, 2 algorithms.

Figures (11)

  • Figure 1: The graphical model of the shared GPLVM with two sets of independent GPs and their respective hyperparameter sets.
  • Figure 2: Shared GPLVM with multiple observation spaces. The blocks on the right-hand side denote the double observation spaces $(X,Y)$ of quasar spectra and scientific labels respectively. In the center are two stacks of GPs, one for each observation space which control the data generation process through the shared latent space. In the figure above we assume $Q=2$ (for ease of visualisation) since we denote the GPs are two dimensional surfaces, however, typically $Q$ can be higher than 2 corresponding to higher dimensional GPs.
  • Figure 3: Reconstruction plots of test quasar spectra with $\pm 1.96\sigma$ intervals. The blue curve denotes the posterior predictive mean at each dimension.
  • Figure 4: Reconstruction of a single spectra from the latent informed by a partially observed spectrum. The shaded orange regions denote the "observed" wavelength regions for this experiment. Note in the 4th panel the $95\%$ prediction intervals are wider as they were observed over a shorter and less informative wavelength window.
  • Figure 5: Scientific label prediction for the quasars' bolometric luminosity (left), black hole mass (middle) and Eddington ratio (right) colored by the SNR of their spectra based on test $X^*$only. The dashed black line () denotes the 1-to-1 line to aid visualisation of reconstruction accuracy. The vertical and horizontal errorbars () denotes posterior predictive standard deviation.
  • ...and 6 more figures