Table of Contents
Fetching ...

Optimizing Temperature for Language Models with Multi-Sample Inference

Weihua Du, Yiming Yang, Sean Welleck

TL;DR

This work addresses how to optimally set temperature for multi-sample inference in large language models without task-specific validation data. It introduces Entropy Turning Point (EntP) and TURN, an entropy-based method that selects near-optimal temperatures by analyzing token-level entropy across temperatures and applying an aggregation-aware adjustment. A stochastic process model reinforces the interpretation that entropy spikes signal quality collapse near the turning point, and token-level entropy serves as a distance proxy between training and task. Across math problem solving and code generation tasks, TURN consistently matches or exceeds fixed-temperature baselines, improves efficiency, and provides interpretable guidance on how training-task similarity shapes temperature needs.

Abstract

Multi-sample aggregation strategies, such as majority voting and best-of-N sampling, are widely used in contemporary large language models (LLMs) to enhance predictive accuracy across various tasks. A key challenge in this process is temperature selection, which significantly impacts model performance. Existing approaches either rely on a fixed default temperature or require labeled validation data for tuning, which are often scarce and difficult to obtain. This paper addresses the challenge of automatically identifying the (near)-optimal temperature for different LLMs using multi-sample aggregation strategies, without relying on task-specific validation data. We provide a comprehensive analysis of temperature's role in performance optimization, considering variations in model architectures, datasets, task types, model sizes, and predictive accuracy. Furthermore, we propose a novel entropy-based metric for automated temperature optimization, which consistently outperforms fixed-temperature baselines. Additionally, we incorporate a stochastic process model to enhance interpretability, offering deeper insights into the relationship between temperature and model performance.

Optimizing Temperature for Language Models with Multi-Sample Inference

TL;DR

This work addresses how to optimally set temperature for multi-sample inference in large language models without task-specific validation data. It introduces Entropy Turning Point (EntP) and TURN, an entropy-based method that selects near-optimal temperatures by analyzing token-level entropy across temperatures and applying an aggregation-aware adjustment. A stochastic process model reinforces the interpretation that entropy spikes signal quality collapse near the turning point, and token-level entropy serves as a distance proxy between training and task. Across math problem solving and code generation tasks, TURN consistently matches or exceeds fixed-temperature baselines, improves efficiency, and provides interpretable guidance on how training-task similarity shapes temperature needs.

Abstract

Multi-sample aggregation strategies, such as majority voting and best-of-N sampling, are widely used in contemporary large language models (LLMs) to enhance predictive accuracy across various tasks. A key challenge in this process is temperature selection, which significantly impacts model performance. Existing approaches either rely on a fixed default temperature or require labeled validation data for tuning, which are often scarce and difficult to obtain. This paper addresses the challenge of automatically identifying the (near)-optimal temperature for different LLMs using multi-sample aggregation strategies, without relying on task-specific validation data. We provide a comprehensive analysis of temperature's role in performance optimization, considering variations in model architectures, datasets, task types, model sizes, and predictive accuracy. Furthermore, we propose a novel entropy-based metric for automated temperature optimization, which consistently outperforms fixed-temperature baselines. Additionally, we incorporate a stochastic process model to enhance interpretability, offering deeper insights into the relationship between temperature and model performance.

Paper Structure

This paper contains 40 sections, 13 equations, 13 figures, 4 tables, 1 algorithm.

Figures (13)

  • Figure 1: (a) The entropy turning point (EntP)(green star) is defined as the temperature point where the log-scale of the token-level entropy of generation (red line) shifts from concave to convex, implying the sudden spike in the entropy curve (blue line). (b) The accuracy tested at EntP is highly correlated with the best accuracy from grid search over temperatures on the MATH dataset.
  • Figure 2: (a) Accuracy Heatmap. Performance of Mistral-7B-Instruct-v0.3 under majority voting across different temperatures. The best temperature for each sampling size is highlighted in bold white, and the optimal temperature range is shaded white. The green line shows the temperature predicted by our method. (b) Midpoint of Optimal Temperature Range vs. Number of Samples. The optimal temperature range varies by model; those with training data more closely matching the task tend to favor higher temperatures.
  • Figure 3: Plot of midpoints of optimal temperature ranges (x-axis, sample size 128) vs. distances between models and tasks (y-axis). A strong negative correlation is observed on the MATH and MBPP datasets, with correlation coefficients of -0.895 and -0.777.
  • Figure 4: Entropy Curve Characteristics.(a) The token-level entropy $\mathcal{H}$ (solid blue line) increases slowly at lower temperatures and then jumps sharply at a critical turning point. In contrast, the entropy for a fixed (greedy) generation stays low (dotted blue line). $\log(\mathcal{H})$(red line) reveals a transition from concavity to convexity that aligns with the sharp increase in $\mathcal{H}$, marking the entropy turning point (EntP). (b) EntP aligns with the best temperature and varies across different models.
  • Figure 5: Stochastic Process Model. We run our process model in the setting: $N_0=10$, $N_1=30000$, $L_0=0$, $\sigma_0=1$, $L_1=-10$, and $\alpha=2$. (a) The entropy curve is similar to that of the real language model: flat at first, and then sharply increases. (b) We calculate the relation between temperature and the percentage of improper tokens in the simulation, and the percentage of improper tokens quickly increases after EntP.
  • ...and 8 more figures