Table of Contents
Fetching ...

Understanding temperature tuning in energy-based models

Peter W Fields, Vudtiwat Ngampruetikorn, David J Schwab, Stephanie E Palmer

TL;DR

Understanding temperature tuning in energy-based models addresses why post-hoc temperature adjustments improve generative outputs in sparse data regimes. The authors develop a physically motivated framework using forward and reversed KL divergences to quantify the fidelity-diversity trade-off and define an optimal sampling temperature. Using a simple toy model and a structured Ising landscape, they show that the optimal temperature is data- and landscape-dependent and may require raising or lowering tau. The work offers a diagnostic perspective for evaluating learned distributions and guiding robust training strategies in biological sequence design and related high-dimensional systems.

Abstract

Generative models of complex systems often require post-hoc parameter adjustments to produce useful outputs. For example, energy-based models for protein design are sampled at an artificially low ''temperature'' to generate novel, functional sequences. This temperature tuning is a common yet poorly understood heuristic used across machine learning contexts to control the trade-off between generative fidelity and diversity. Here, we develop an interpretable, physically motivated framework to explain this phenomenon. We demonstrate that in systems with a large ''energy gap'' - separating a small fraction of meaningful states from a vast space of unrealistic states - learning from sparse data causes models to systematically overestimate high-energy state probabilities, a bias that lowering the sampling temperature corrects. More generally, we characterize how the optimal sampling temperature depends on the interplay between data size and the system's underlying energy landscape. Crucially, our results show that lowering the sampling temperature is not always desirable; we identify the conditions where \emph{raising} it results in better generative performance. Our framework thus casts post-hoc temperature tuning as a diagnostic tool that reveals properties of the true data distribution and the limits of the learned model.

Understanding temperature tuning in energy-based models

TL;DR

Understanding temperature tuning in energy-based models addresses why post-hoc temperature adjustments improve generative outputs in sparse data regimes. The authors develop a physically motivated framework using forward and reversed KL divergences to quantify the fidelity-diversity trade-off and define an optimal sampling temperature. Using a simple toy model and a structured Ising landscape, they show that the optimal temperature is data- and landscape-dependent and may require raising or lowering tau. The work offers a diagnostic perspective for evaluating learned distributions and guiding robust training strategies in biological sequence design and related high-dimensional systems.

Abstract

Generative models of complex systems often require post-hoc parameter adjustments to produce useful outputs. For example, energy-based models for protein design are sampled at an artificially low ''temperature'' to generate novel, functional sequences. This temperature tuning is a common yet poorly understood heuristic used across machine learning contexts to control the trade-off between generative fidelity and diversity. Here, we develop an interpretable, physically motivated framework to explain this phenomenon. We demonstrate that in systems with a large ''energy gap'' - separating a small fraction of meaningful states from a vast space of unrealistic states - learning from sparse data causes models to systematically overestimate high-energy state probabilities, a bias that lowering the sampling temperature corrects. More generally, we characterize how the optimal sampling temperature depends on the interplay between data size and the system's underlying energy landscape. Crucially, our results show that lowering the sampling temperature is not always desirable; we identify the conditions where \emph{raising} it results in better generative performance. Our framework thus casts post-hoc temperature tuning as a diagnostic tool that reveals properties of the true data distribution and the limits of the learned model.

Paper Structure

This paper contains 19 sections, 41 equations, 7 figures, 1 algorithm.

Figures (7)

  • Figure 1: (a) Schematic of training, modifying, and sampling generative models. Each box represents the entirety of state space---each dot a data point in that space. (a-Left) Training data (black dots) are generated by a ground truth distribution, $p$ (solid line). A model is fit to these data, $\hat{q}$ (dotted lines). (a-Middle) Generate samples from the model. Samples may either be taken from areas of high probability in the ground truth (blue dots) or from areas of low probability in the ground truth (red solid dots---false positives). Note that depending on the ground truth distribution, the fit model may also fail to generate relevant samples (red empty dots---false negatives). (a-Right) Modifying the sampling technique gives larger proportion of relevant samples. At top, this comes at the expense of missing some areas of ground truth distribution, increasing false negatives. At bottom, this increases the sampling of false positives. (b) The trade-off between sampling states from the fit model that are probable, i.e. less surprising/more believable, according to the true distribution versus the diversity of said states---captured by the trade off between entropy, $H[\hat{q}_\tau]$, and cross-entropy, $H[\hat{q}_\tau, p]$, where the above curve is parameterized by sampling temperature $\tau$ of the trained distribution $\hat{q}$. The difference of these two quantities is $D_{\mathrm{KL}}(\hat{q}_\tau||p)$ whose minimum (over $\tau$) is denoted by $\tau^{*}$. Note that $\tau^{*}$ need not equal 1, which would denote sampling the model "as is." (Inset) Optimal trade-off evidenced by peak in ratio of $H[\hat{q}_\tau]$ to $H[\hat{q}_\tau, p]$.
  • Figure 2: Simple toy example of when raising versus lowering temperature improves generative performance. (a) The ground truth consists of a vector, $\mathbf{L}$, that assigns each of 10 states to a low- or high-energy level. Sampling this distribution leads to an empirical distribution over states, $\mathbf{p}_{\text{data}}$, used to get maximum likelihood estimates of the energy level assignment vector and energy gap, $\hat{\mathbf{L}}$ and $\hat{\Delta}$. (b) True model in yellow with $\Delta=4$ and data distribution, made from 10 samples, represented by dotted lines. Fitting to an under-sampled distribution causes the maximum likelihood estimates to over-estimate the probability mass on high-energy states and under-estimate the mass on the "missed" low-energy state. (c) Checking the decomposition of the forward (red) and reversed (blue) $D_{\mathrm{KL}}$'s between the fit and true distributions. Note the main contributions to each $D_{\mathrm{KL}}$: the missed low-energy states for the forward and the high-energy states for the reverse. (d) Rescaling the inferred energy gap by $\tau$, while keeping $\hat{\mathbf{L}}$ fixed, affects forward and reversed $D_{\mathrm{KL}}$'s. Note that temperature must be changed in opposite directions to achieve minima for each. The minimum of the reversed (blue) $D_{\mathrm{KL}}$ corresponds to $\tau^{*}$ in Fig. \ref{['fig:cartoon-outline']}. (e-f) $D_{\mathrm{KL}}$'s decomposed into contributions from missed low-energy states, found low-energy states, and high-energy states. Note that raising the temperature, $\tau$, leads to mitigation of the contribution from the missed low-energy state for the forward $D_{\mathrm{KL}}$ in (f) and lowering $\tau$ leads to mitigation of the contribution from the high-energy states to the reversed $D_{\mathrm{KL}}$ in (e). (g-h) Scaled color images of mean optimal $\tau^*$ and $\tau'$---calculated from 50 replicates on experiments of a ground truth with $n_l=20$ and $n_h=80$ for several values of $M$ and $\Delta$. See Appendix \ref{['app:tau_calcs']}, for details regarding calculation of $\tau^*$ and $\tau'$.
  • Figure 3: Experiments on a $4 \times 4$ nearest neighbor Ising model. (a) Starting at left and going clockwise, the ground truth model is defined at a given temperature $T$ by Eq. (\ref{['eq:toy']}). $M$ samples are taken to form the training set, $\mathcal{D}_T.$ These data are fit via minimization of Eq. (\ref{['eq:like']}) to give the parameters $\hat{\mathbf{J}}$. The generative properties are measured via $D_{\mathrm{KL}}(p_T||\hat{q} _\tau)$ and $D_{\mathrm{KL}}(\hat{q}_\tau||p_T)$ (see text). (b-d) Breakdown of contributions to each $D_{\mathrm{KL}}$ by energy level, Eqs. (\ref{['eq:level_set']})-(\ref{['eq:dkl_breakdown']}). (b) Density of states for the first 9 excited energy levels of a $4 \times 4$ nearest neighbor Ising model. (c-g) Results from one experiment where $M=93$ and $T=2.3$. (c) The average amount of probability per state within each energy level, as described by Eqs. (\ref{['eq:q_i']}) and (\ref{['eq:p_i']}). The amount of probability per state is underestimated by $\hat{q}$ for lower excited states and overestimated for higher excited states. (d) The contribution to each $D_{\mathrm{KL}}$ per energy level. The lower excited states are more deleterious to the forward $D_{\mathrm{KL}}$ (red) and the higher excited states are more deleterious to the reversed $D_{\mathrm{KL}}$ (blue). (e-g) Raising versus lowering sampling temperature $\tau$ and its dependence on contributions to each $D_{\mathrm{KL}}$ from states at different energy levels. (e) $D_{\mathrm{KL}}$'s as a function of sampling temperature $\tau$. Note the minima of each are located on opposite sides of $\tau=1$. (f) The reversed $D_{\mathrm{KL}}$ per energy level. Lowering $\tau$ to $\tau^*$ mainly mitigates contributions from higher excited states. (g) The forward $D_{\mathrm{KL}}$ broken down by contributions from different energy levels. Raising $\tau$ to $\tau'$ decreases the major contribution from excited states 1-3. (h-i) Ten replicates of an experiment at each $T$ and $M$ are conducted and the corresponding optimal $\tau^*$ and $\tau'$ are found for each. The scaled color image depicts the average over replicates.
  • Figure A1: $\kappa$ and $C$, Eqs. (\ref{['eq:crosscap']})-(\ref{['eq:cap']}), determine whether to raise or lower $\tau$ in order to improve generative performance. (a-f) Experiments on the illustrative toy model for 5 low-energy states and 10 high-energy states, with models fit to 10 training data. (a-c) One experiment for ground truth $\Delta=2$. Low training data causes erroneous assignment of excited states as ground states in the model (a), weak correlation of model with true distribution, $\kappa < C$, makes it advantageous to raise $\tau$, (b) and (c). (d-f) One experiment for $\Delta = 7$. Strong correlation of model with true distribution, $\kappa~>~C$ in (a) makes it advantageous to lower $\tau$, (b) and (c). In (g), each point represents an average of 200 replicates of experiments done for several values of $\Delta$, fixed at 80 training data. The illustrative toy model contains 20 low-energy states and 80 high-energy states as in Fig. \ref{['fig:simple']}(g) and (h).
  • Figure A2: Per energy-level breakdown of the reversed $D_{\text{KL}}$ and its derivative with respect to $\tau$ reveals what dictates the need to raise and lower $\tau$. (a-b) Show results of an experiment done on the $4 \times 4$ nearest-neighbor Ising distribution at $T=2.3$ and $\hat{q}$ trained on $M=54$ samples. In (a) contributions to the reversed $D_{\text{KL}}$ are shown for the first 9 excited energy levels (b) and contributions to its derivative w.r.t. $\tau$ evaluated at $\tau=1$ are positive, indiciative of a need to lower $\tau$. (c-d) Results of an experiment done at $T=4$ for $M=54$. Strong contributions to reversed $D_{\text{KL}}$ also come from excited states (c), however negative contributions dominate the derivative (d), revealing a need to raise $\tau$. (e) Many experiments done on the $4\times4$ Ising distribution at various ground truth $T$. Each point represents an average over 10 experiments done with 54 training data each. For low $T$, the need to lower $\tau$ is necessitated by a strong correlation to the true energy function relative to the model's energy variance; $\kappa > C$, corresponding to a positive value of $\frac{\partial}{\partial\tau}D_{\text{KL}}(\hat{q}_\tau||p_T)\Bigr |_{\tau=1}$. For high $T$, the intra-model variance dominates, and probability mass can spread out over state space faster than it comes off of true low-energy states, i.e. $\kappa < C$ and $\tau$ should be raised.
  • ...and 2 more figures