Table of Contents
Fetching ...

Monte Carlo Temperature: a robust sampling strategy for LLM's uncertainty quantification methods

Nicola Cecere, Andrea Bacciu, Ignacio Fernández Tobías, Amin Mantrach

TL;DR

Uncertainty quantification for LLMs is highly sensitive to the temperature parameter $\tau$ used in multi-sample decoding, which can undermine reliability and drive costly hyperparameter optimization. The paper introduces Monte Carlo Temperature (MCT), a robust sampling strategy that draws $k$ temperatures from a distribution over $[\tau_{min},\tau_{max}]$, generates $k$ responses, and applies existing multi-sample UQ methods without requiring temperature calibration. Across multiple open-source LLMs and standard QA datasets, MCT achieves statistical parity with oracle temperatures and outperforms fixed best-average and random baselines in AUROC, PR-AUC, and AURAC, demonstrating robust uncertainty estimates with reduced computational burden. This approach offers a practical, scalable solution to reliable UQ in real-world LLM deployments, enabling safer and more dependable AI systems without extensive hyperparameter tuning.

Abstract

Uncertainty quantification (UQ) in Large Language Models (LLMs) is essential for their safe and reliable deployment, particularly in critical applications where incorrect outputs can have serious consequences. Current UQ methods typically rely on querying the model multiple times using non-zero temperature sampling to generate diverse outputs for uncertainty estimation. However, the impact of selecting a given temperature parameter is understudied, and our analysis reveals that temperature plays a fundamental role in the quality of uncertainty estimates. The conventional approach of identifying optimal temperature values requires expensive hyperparameter optimization (HPO) that must be repeated for each new model-dataset combination. We propose Monte Carlo Temperature (MCT), a robust sampling strategy that eliminates the need for temperature calibration. Our analysis reveals that: 1) MCT provides more robust uncertainty estimates across a wide range of temperatures, 2) MCT improves the performance of UQ methods by replacing fixed-temperature strategies that do not rely on HPO, and 3) MCT achieves statistical parity with oracle temperatures, which represent the ideal outcome of a well-tuned but computationally expensive HPO process. These findings demonstrate that effective UQ can be achieved without the computational burden of temperature parameter calibration.

Monte Carlo Temperature: a robust sampling strategy for LLM's uncertainty quantification methods

TL;DR

Uncertainty quantification for LLMs is highly sensitive to the temperature parameter used in multi-sample decoding, which can undermine reliability and drive costly hyperparameter optimization. The paper introduces Monte Carlo Temperature (MCT), a robust sampling strategy that draws temperatures from a distribution over , generates responses, and applies existing multi-sample UQ methods without requiring temperature calibration. Across multiple open-source LLMs and standard QA datasets, MCT achieves statistical parity with oracle temperatures and outperforms fixed best-average and random baselines in AUROC, PR-AUC, and AURAC, demonstrating robust uncertainty estimates with reduced computational burden. This approach offers a practical, scalable solution to reliable UQ in real-world LLM deployments, enabling safer and more dependable AI systems without extensive hyperparameter tuning.

Abstract

Uncertainty quantification (UQ) in Large Language Models (LLMs) is essential for their safe and reliable deployment, particularly in critical applications where incorrect outputs can have serious consequences. Current UQ methods typically rely on querying the model multiple times using non-zero temperature sampling to generate diverse outputs for uncertainty estimation. However, the impact of selecting a given temperature parameter is understudied, and our analysis reveals that temperature plays a fundamental role in the quality of uncertainty estimates. The conventional approach of identifying optimal temperature values requires expensive hyperparameter optimization (HPO) that must be repeated for each new model-dataset combination. We propose Monte Carlo Temperature (MCT), a robust sampling strategy that eliminates the need for temperature calibration. Our analysis reveals that: 1) MCT provides more robust uncertainty estimates across a wide range of temperatures, 2) MCT improves the performance of UQ methods by replacing fixed-temperature strategies that do not rely on HPO, and 3) MCT achieves statistical parity with oracle temperatures, which represent the ideal outcome of a well-tuned but computationally expensive HPO process. These findings demonstrate that effective UQ can be achieved without the computational burden of temperature parameter calibration.

Paper Structure

This paper contains 16 sections, 1 equation, 7 figures, 3 tables.

Figures (7)

  • Figure 1: AUROC score distributions of the semantic entropy method across various model-dataset combinations and different fixed temperature values.
  • Figure 2: Comparison between oracle-fixed temperature performance and MCT sampling strategy performance across different UQ methods using the AUROC metric.
  • Figure 3: AUROC score distributions of tested UQ methods across various model-dataset combinations at different fixed temperature values.
  • Figure 4: PR-AUC score distributions of tested UQ methods across various model-dataset combinations at different fixed temperature values.
  • Figure 5: AURAC score distributions of tested UQ methods across various model-dataset combinations at different fixed temperature values.
  • ...and 2 more figures