Table of Contents
Fetching ...

Data-Driven Stellar Spectral Modelling with GSPICE

Douglas P. Finkbeiner, Joshua S. Speagle, Tanveer Karim

Abstract

Spectral data reduction pipelines deal with a wide variety of challenges including masking cosmic rays, calibrating wavelength solutions, and estimating background noise while trying to remain model-agnostic. Traditional methods rely on hardware-specific code or pre-calculated stellar model templates to solve this problem, making them model-dependent and not suitable for large datasets that may contain new classes of objects. To solve this problem, we present a flexible, data-driven method: the GausSian PIxelwise Conditional Estimator (GSPICE) that models an ensemble of spectra as a multivariate Gaussian and estimates the expected value and expected variance of each pixel in each spectrum conditional on others. GSPICE compares observed fluxes and errors to its own flux and error estimates to reveal outliers, which then can be completely masked or replaced by their estimates. We apply GSPICE to 3.9 million stellar spectra from the LAMOST survey, and show that variations of the method can directly identify and correct both individual pixel-level outliers (e.g., from cosmic ray hits) as well as extended systematic features (e.g., from incorrect wavelength calibrations), while still providing a novel characterization of the true per-pixel measurement uncertainties. We also demonstrate how GSPICE can take advantage of data partitioning with an application to diffuse interstellar bands. Implementations of GSPICE in both Python and IDL can be found here http://github.com/dfink/gspice.

Data-Driven Stellar Spectral Modelling with GSPICE

Abstract

Spectral data reduction pipelines deal with a wide variety of challenges including masking cosmic rays, calibrating wavelength solutions, and estimating background noise while trying to remain model-agnostic. Traditional methods rely on hardware-specific code or pre-calculated stellar model templates to solve this problem, making them model-dependent and not suitable for large datasets that may contain new classes of objects. To solve this problem, we present a flexible, data-driven method: the GausSian PIxelwise Conditional Estimator (GSPICE) that models an ensemble of spectra as a multivariate Gaussian and estimates the expected value and expected variance of each pixel in each spectrum conditional on others. GSPICE compares observed fluxes and errors to its own flux and error estimates to reveal outliers, which then can be completely masked or replaced by their estimates. We apply GSPICE to 3.9 million stellar spectra from the LAMOST survey, and show that variations of the method can directly identify and correct both individual pixel-level outliers (e.g., from cosmic ray hits) as well as extended systematic features (e.g., from incorrect wavelength calibrations), while still providing a novel characterization of the true per-pixel measurement uncertainties. We also demonstrate how GSPICE can take advantage of data partitioning with an application to diffuse interstellar bands. Implementations of GSPICE in both Python and IDL can be found here http://github.com/dfink/gspice.

Paper Structure

This paper contains 17 sections, 30 equations, 12 figures, 4 algorithms.

Figures (12)

  • Figure 1: Gaussian conditional estimate of $x_2$ given $x_1$. A joint distribution $P(x_1,x_2)$ (gray ellipse, upper left panel) represents the probability distribution of many correlated realizations of $x_1$ and $x_2$ (red lines, upper right). The probability density of $x_2$ conditional on $x_1=0$ (red line, lower left panel), $P(x_2|x_1=0)$, is represented graphically (lower right) by the distribution of $x_2$ given the fixed value of $x_1=0$. See Section \ref{['sec:method']} for additional details.
  • Figure 2: Covariance matrix of stellar spectra for $\lambda=3800-9000$Å (left) and a more limited wavelength range (right), as described in Section \ref{['sec:lamost']}. The most obvious bright spots in the right panel are H$\delta$ (4100), H$\gamma$ (4340), and H$\beta$ (4860). Bright spots have a dark cross-halo because of the continuum normalization.
  • Figure 3: An illustration showing how Gaussian conditional estimates can handle heterogeneous data. Top row: Distribution of values of two spectral pixels (A and B) across a sample of two heterogeneous populations (green and orange). Their corresponding matrix view (shown in the same color) shows that their means (top two numbers) and covariances (bottom 2x2 numbers) are very different. Second row: A single Gaussian model fit to the heterogeneous population (purple). Third row: The estimates for the value of pixel A conditioning on pixel B (red dashed line) based on the true distribution and the joint Gaussian. Bottom row: Conditional probability of pixel A given the value of pixel B (i.e. probability along the slice) for the two true populations and the joint Gaussian fit. Even if the joint Gaussian is overall a poor fit to the entire population, its conditional estimates nevertheless yield very reasonable answers.
  • Figure 4: Pixels used for Gaussian conditional estimation (green). Pixel $i$ is predicted using all other pixels in the spectrum, except for a guard region around $i$ with length $N_{\rm guard}$. In this work we take $N_{\rm guard}=20$, driven by the size of the smoothing kernel used for continuum normalization.
  • Figure 5: A corner plot showing the estimated distribution of the stellar parameters in LAMOST DR5 that are used to compute the covariance matrix shown in Figure \ref{['fig:cov']} and for the experiments highlighted in later figures. This highlights that while the underlying stellar population is heterogeneous with a very non-Gaussian distribution, the conditional Gaussian estimation employed by GSPICE still remains valid.
  • ...and 7 more figures