Table of Contents
Fetching ...

Estimating Probability Densities with Transformer and Denoising Diffusion

Henry W. Leung, Jo Bovy, Joshua S. Speagle

TL;DR

This work tackles the limitation of scalar predictions in scientific regression by introducing an encoder-only Transformer equipped with a denoising diffusion probabilistic model head to estimate conditional probability densities. The model can generate samples and densities conditioned on arbitrary input combinations, enabling non-Gaussian and multimodal outputs. Demonstrations on Galactic stellar data and a California Housing dataset show that the method recovers training densities, produces sensible conditional densities, and even constructs multi-dimensional distributions through sequential conditioning, offering a flexible and scalable density emulator for scientific foundation models. This approach enhances uncertainty quantification and applicability of large-scale foundation models to complex, high-dimensional scientific inference tasks.

Abstract

Transformers are often the go-to architecture to build foundation models that ingest a large amount of training data. But these models do not estimate the probability density distribution when trained on regression problems, yet obtaining full probabilistic outputs is crucial to many fields of science, where the probability distribution of the answer can be non-Gaussian and multimodal. In this work, we demonstrate that training a probabilistic model using a denoising diffusion head on top of the Transformer provides reasonable probability density estimation even for high-dimensional inputs. The combined Transformer+Denoising Diffusion model allows conditioning the output probability density on arbitrary combinations of inputs and it is thus a highly flexible density function emulator of all possible input/output combinations. We illustrate our Transformer+Denoising Diffusion model by training it on a large dataset of astronomical observations and measured labels of stars within our Galaxy and we apply it to a variety of inference tasks to show that the model can infer labels accurately with reasonable distributions.

Estimating Probability Densities with Transformer and Denoising Diffusion

TL;DR

This work tackles the limitation of scalar predictions in scientific regression by introducing an encoder-only Transformer equipped with a denoising diffusion probabilistic model head to estimate conditional probability densities. The model can generate samples and densities conditioned on arbitrary input combinations, enabling non-Gaussian and multimodal outputs. Demonstrations on Galactic stellar data and a California Housing dataset show that the method recovers training densities, produces sensible conditional densities, and even constructs multi-dimensional distributions through sequential conditioning, offering a flexible and scalable density emulator for scientific foundation models. This approach enhances uncertainty quantification and applicability of large-scale foundation models to complex, high-dimensional scientific inference tasks.

Abstract

Transformers are often the go-to architecture to build foundation models that ingest a large amount of training data. But these models do not estimate the probability density distribution when trained on regression problems, yet obtaining full probabilistic outputs is crucial to many fields of science, where the probability distribution of the answer can be non-Gaussian and multimodal. In this work, we demonstrate that training a probabilistic model using a denoising diffusion head on top of the Transformer provides reasonable probability density estimation even for high-dimensional inputs. The combined Transformer+Denoising Diffusion model allows conditioning the output probability density on arbitrary combinations of inputs and it is thus a highly flexible density function emulator of all possible input/output combinations. We illustrate our Transformer+Denoising Diffusion model by training it on a large dataset of astronomical observations and measured labels of stars within our Galaxy and we apply it to a variety of inference tasks to show that the model can infer labels accurately with reasonable distributions.
Paper Structure (9 sections, 2 equations, 8 figures)

This paper contains 9 sections, 2 equations, 8 figures.

Figures (8)

  • Figure 1: High level model architecture of our Transformer+Denoising Diffusion foundation model. The goal of the model is to estimate the probability density of an output based on a list of input scientific data and what unknown data is being requested. The role of the Denoising Diffusion Probabilistic Model (DDPM) is to turn the hidden state at the first position from the Transformer to a probability density distribution.
  • Figure 2: Probability density of different surface temperature $T_\mathrm{eff}$ (leftmost), surface gravity $\log g$ (middle left), metallicity $[\mathrm{M/H}]$ (middle right) and G-band luminosity (rightmost) of stars based on no conditions. The blue solid lines show the probability density from our model while the orange colored histogram shows the probability density from the training set. This figure demonstrates that our model learns the training set distribution on various labels when there is no condition.
  • Figure 3: Probability density of surface gravity $\log g$ given different surface temperature $T_\mathrm{eff}$ and reddening $E(B-V)$ (hence extinction) of stars. All panels show the probability density from the model (solid blue lines) and the training set (orange filled area) based on the condition provided in the panel's title. As reddening increases, a higher proportion of stars are expected to be intrinsically bright giant stars rather than intrinsically dim dwarf stars. This figure demonstrates that our model learns the correct output distribution of various labels when there are only a few conditions leading to an ambiguous answer not captured by traditional Transformer-only models.
  • Figure 4: Inference of surface temperature $T_\mathrm{eff}$ from different combinations of input data from the testing set. NN $T_\mathrm{eff}$ is the median of the output probability distribution with the robust standard deviation (i.e., $1.4826 \times$median absolute deviation) of the distribution as the output uncertainty represented by colors. In the left panel, the input contains the whole Gaia XP spectra with colors (113 data point in total which is almost double the 64 context size used during training) and the $T_\mathrm{eff}$ is accurate with reasonable uncertainty. In the middle panel, the input contains the most uninformative part of the Gaia XP spectra and the model prediction of $T_\mathrm{eff}$ also suffers with large uncertainty as expected. In the right panel, the same input as the middle panel is given, but in addition $T_\mathrm{eff}$ is mixed into the inputs, in this case the model predicts almost perfect $T_\mathrm{eff}$ with very low uncertainty.
  • Figure 5: The distribution of the quantile a which the ground truth is found in the probability density distribution inferred from our model. Similar to the training procedure, we randomly select a subset of data as input and a random label as output for each star. The blue colored histogram show the quantile distribution while the red dotted line represents the uniform distribution, which the quantiles should follow if the model probability distributions were exactly correct. The quantile distribution closely follows the red dotted line with a slight over-concentration around the $50\%$ quantile, indicating that we are slightly overestimating the uncertainty in the outputs.
  • ...and 3 more figures