Long Horizon Temperature Scaling

Andy Shih; Dorsa Sadigh; Stefano Ermon

Long Horizon Temperature Scaling

Andy Shih, Dorsa Sadigh, Stefano Ermon

TL;DR

Long Horizon Temperature Scaling addresses the limitation that myopic temperature scaling optimizes only the next-token likelihood rather than the joint sequence likelihood. It introduces an amortized, model-agnostic objective that trains a $q_T$ to approximate $p_T(x)$ by minimizing $KL(p_T\|q_T)$ via importance weights dependent on the base model $p$ and the temperature $T$, with variance-reduction via baselines and horizon-limited suffixes. The method applies across diffusion and autoregressive models, enabling a controllable long-horizon temperature with a single finetuned model and achieving improved likelihood-diversity tradeoffs and downstream task gains, including analogy tasks. The approach supports extrapolation to unseen temperatures and provides practical techniques (clipping, streaming statistics, multi-temperature finetuning) to make joint-temperature sampling tractable in real-world settings.

Abstract

Temperature scaling is a popular technique for tuning the sharpness of a model distribution. It is used extensively for sampling likely generations and calibrating model uncertainty, and even features as a controllable parameter to many large language models in deployment. However, autoregressive models rely on myopic temperature scaling that greedily optimizes the next token. To address this, we propose Long Horizon Temperature Scaling (LHTS), a novel approach for sampling from temperature-scaled joint distributions. LHTS is compatible with all likelihood-based models, and optimizes for the long horizon likelihood of samples. We derive a temperature-dependent LHTS objective, and show that finetuning a model on a range of temperatures produces a single model capable of generation with a controllable long horizon temperature parameter. We experiment with LHTS on image diffusion models and character/language autoregressive models, demonstrating advantages over myopic temperature scaling in likelihood and sample quality, and showing improvements in accuracy on a multiple choice analogy task by $10\%$.

Long Horizon Temperature Scaling

TL;DR

to approximate

by minimizing

via importance weights dependent on the base model

and the temperature

, with variance-reduction via baselines and horizon-limited suffixes. The method applies across diffusion and autoregressive models, enabling a controllable long-horizon temperature with a single finetuned model and achieving improved likelihood-diversity tradeoffs and downstream task gains, including analogy tasks. The approach supports extrapolation to unseen temperatures and provides practical techniques (clipping, streaming statistics, multi-temperature finetuning) to make joint-temperature sampling tractable in real-world settings.

Abstract

Paper Structure (35 sections, 2 theorems, 13 equations, 6 figures, 4 tables)

This paper contains 35 sections, 2 theorems, 13 equations, 6 figures, 4 tables.

Introduction
Background
Myopic temperature scaling
Pseudo-temperature scaling
Related Work
Long Horizon Temperature Scaling
LHTS on Hierarchical Latent Variable Models
Variance-Reduced LHTS on Autoregressive Models
Computing Suffix Likelihoods
Suffix Horizon Length
Implementation
Clipping
Data Sampling
Multi-Temperature Finetuning
KL Loss
...and 20 more sections

Key Result

Corollary 4.1

$\frac{e^b}{Z_{p_T}} \mathcal{L}(q_T) = KL(p_T || q_T) + H(p_T)$.

Figures (6)

Figure 1: Pitfalls of myopic temperature scaling. At the top of the diagram, we depict prompting a language model for a choice of three actions. The language model may respond with each choice with a probability of $0.3$ (shown in green), and a remaining probability of $0.1$ of outputting irrelevant answers. To reduce the probability of irrelevant answers, we can lower the temperature of the model. In blue, we show that myopic temperature scaling will unintuitively lump the probabilities for the two actions "tap cabinet" and "tap door", because they share the same first token "tap". Therefore, lowering the myopic temperature will emphasize the probability on these two choices, and diminish the probability of choosing "close door". On the other other hand, in orange we show that long horizon temperature scaling correctly scales the joint probability of the full sequence, equally distributing a probability of one-third among the three choices.
Figure 2: LHTS Finetuning
Figure 3: Temperature scaling on diffusion models for CIFAR-10. The black dots form the Pareto frontier of pseudo-temperature scaling on DDPM (with pseudo-temperatures $0.99$, $0.985$, and $0.98$), and the orange shows long horizon temperature scaling via finetuning (with long horizon temperatures $0.999$, $0.995$, $0.99$). The x-axis plots log likelihood and y-axis plots negative FID score using $50$k samples. Towards the top right of the chart is better.
Figure 4: Generated image samples from temperature scaled DDPM. Left: pseudo-temperature scaling, with worse FID score $3.94$ and lower sample likelihood $-3.09$. Right: LHTS, with better FID score $3.66$ and higher sample likelihood $-3.07$.
Figure 5: Autoregressive character model with a tunable long horizon temperature parameter. The heatmap shows log-likelihood of samples over various settings of long horizon and myopic temperature. Tuning both temperatures (orange) allows us to increase the likelihood more than just tuning the myopic temperature (blue). More importantly, we achieve a better trade-off between likelihood and diversity. The orange setting gives a higher likelihood with noticeably diverse chunks of text, whereas the blue setting gives lower likelihood yet gives many repetitive generations.
...and 1 more figures

Theorems & Definitions (4)

Corollary 4.1
proof
Proposition 4.2
proof

Long Horizon Temperature Scaling

TL;DR

Abstract

Long Horizon Temperature Scaling

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (4)