Table of Contents
Fetching ...

The Well-Tempered Classifier: Some Elementary Properties of Temperature Scaling

Pierre-Alexandre Mattei, Bruno Loureiro

TL;DR

It is shown that increasing the temperature increases the uncertainty in the model in a very general sense (and in particular increases its entropy) and it is shown that temperature scaling is the only linear scaler that does not change the hard predictions of the model.

Abstract

Temperature scaling is a simple method that allows to control the uncertainty of probabilistic models. It is mostly used in two contexts: improving the calibration of classifiers and tuning the stochasticity of large language models (LLMs). In both cases, temperature scaling is the most popular method for the job. Despite its popularity, a rigorous theoretical analysis of the properties of temperature scaling has remained elusive. We investigate here some of these properties. For classification, we show that increasing the temperature increases the uncertainty in the model in a very general sense (and in particular increases its entropy). However, for LLMs, we challenge the common claim that increasing temperature increases diversity. Furthermore, we introduce two new characterisations of temperature scaling. The first one is geometric: the tempered model is shown to be the information projection of the original model onto the set of models with a given entropy. The second characterisation clarifies the role of temperature scaling as a submodel of more general linear scalers such as matrix scaling and Dirichlet calibration: we show that temperature scaling is the only linear scaler that does not change the hard predictions of the model.

The Well-Tempered Classifier: Some Elementary Properties of Temperature Scaling

TL;DR

It is shown that increasing the temperature increases the uncertainty in the model in a very general sense (and in particular increases its entropy) and it is shown that temperature scaling is the only linear scaler that does not change the hard predictions of the model.

Abstract

Temperature scaling is a simple method that allows to control the uncertainty of probabilistic models. It is mostly used in two contexts: improving the calibration of classifiers and tuning the stochasticity of large language models (LLMs). In both cases, temperature scaling is the most popular method for the job. Despite its popularity, a rigorous theoretical analysis of the properties of temperature scaling has remained elusive. We investigate here some of these properties. For classification, we show that increasing the temperature increases the uncertainty in the model in a very general sense (and in particular increases its entropy). However, for LLMs, we challenge the common claim that increasing temperature increases diversity. Furthermore, we introduce two new characterisations of temperature scaling. The first one is geometric: the tempered model is shown to be the information projection of the original model onto the set of models with a given entropy. The second characterisation clarifies the role of temperature scaling as a submodel of more general linear scalers such as matrix scaling and Dirichlet calibration: we show that temperature scaling is the only linear scaler that does not change the hard predictions of the model.
Paper Structure (29 sections, 9 theorems, 51 equations, 3 figures)

This paper contains 29 sections, 9 theorems, 51 equations, 3 figures.

Key Result

Lemma 1

We have

Figures (3)

  • Figure 1: Entropies of two tempered versions of a toy language model. Standard temperature scaling has the expected monotonic effect on entropy, but is impractical for LLMs. The ‘‘myopic" version, widely used in practice, has a much less intuitive behaviour.
  • Figure 2: Temperature scaling as an information projection. In both cases, the initial model $p$ is projected onto the set $\mathcal{Q}$ of distributions of constant entropy $h^\star=0.9$. The blue arrow is the path of all tempered models between $p$ and its projection $p_{\beta^\star}$. This path can be interpreted as the geodesic between $p$ and $p_{\beta^\star}$. (Left) The initial distribution $p=(0.01, 0.09, 0.9)$ has a lower entropy than $h^\star$, and is warmed up towards $p_{\beta^\star}$ with $\beta^\star \approx 0.37$. (Right) The initial distribution $p=(0.4, 0.35, 0.25)$ has a higher entropy than $h^\star$, and is cooled down towards $p_{\beta^\star}$ with $\beta^\star \approx 3.98$.
  • Figure 3: Terms involved in the computation of the entropy of the myopically scaled model. (Left) The marginal and conditional entropies of Equation \ref{['eq:app_chainrule1']}. (Right) The two terms of the decomposition of Equation \ref{['eq:app_chainrule2']}. The source of nonmotonicity is the term $\pi_\beta H(p_{\beta}(x_2| x_1 = 1)$, that is the product of an increasing and a decreasing functions.

Theorems & Definitions (12)

  • Lemma 1
  • Proposition 1
  • Proposition 2
  • Definition 1
  • Theorem 3.1
  • Theorem 4.1
  • Theorem 5.1
  • Corollary 1
  • Theorem F.1
  • proof
  • ...and 2 more