Shared Stochastic Gaussian Process Latent Variable Models: A Multi-modal Generative Model for Quasar Spectra
Vidhi Lalchand, Anna-Christina Eilers
TL;DR
The paper tackles the challenge of learning from large, noisy, and heterogeneous astronomical data by introducing a scalable Shared stochastic GPLVM that jointly models two observation spaces—quasar spectra and derived physical labels—through a common latent representation. By extending stochastic variational GPLVM to a shared latent space with two independent GP decoders, it enables missing-data handling and cross-modal generation and prediction (X → Z → Y). The method is validated on SDSS quasar data (tens of thousands of objects), showing improved reconstruction and predictive performance over baselines, and demonstrates the ability to generate spectra conditioned on synthesized labels, as well as inferring physical properties from spectra alone. This framework opens new avenues for simultaneous spectral analysis, physical inference, and potential standardization of quasars as cosmological probes, while pointing to future work on encoder extensions and kernel design to further enhance applicability and efficiency.
Abstract
This work proposes a scalable probabilistic latent variable model based on Gaussian processes (Lawrence, 2004) in the context of multiple observation spaces. We focus on an application in astrophysics where data sets typically contain both observed spectral features and scientific properties of astrophysical objects such as galaxies or exoplanets. In our application, we study the spectra of very luminous galaxies known as quasars, along with their properties, such as the mass of their central supermassive black hole, accretion rate, and luminosity-resulting in multiple observation spaces. A single data point is then characterized by different classes of observations, each with different likelihoods. Our proposed model extends the baseline stochastic variational Gaussian process latent variable model (GPLVM) introduced by Lalchand et al. (2022) to this setting, proposing a seamless generative model where the quasar spectra and scientific labels can be generated simultaneously using a shared latent space as input to different sets of Gaussian process decoders, one for each observation space. Additionally, this framework enables training in a missing data setting where a large number of dimensions per data point may be unknown or unobserved. We demonstrate high-fidelity reconstructions of the spectra and scientific labels during test-time inference and briefly discuss the scientific interpretations of the results, along with the significance of such a generative model.
