Monte Carlo Temperature: a robust sampling strategy for LLM's uncertainty quantification methods
Nicola Cecere, Andrea Bacciu, Ignacio Fernández Tobías, Amin Mantrach
TL;DR
Uncertainty quantification for LLMs is highly sensitive to the temperature parameter $\tau$ used in multi-sample decoding, which can undermine reliability and drive costly hyperparameter optimization. The paper introduces Monte Carlo Temperature (MCT), a robust sampling strategy that draws $k$ temperatures from a distribution over $[\tau_{min},\tau_{max}]$, generates $k$ responses, and applies existing multi-sample UQ methods without requiring temperature calibration. Across multiple open-source LLMs and standard QA datasets, MCT achieves statistical parity with oracle temperatures and outperforms fixed best-average and random baselines in AUROC, PR-AUC, and AURAC, demonstrating robust uncertainty estimates with reduced computational burden. This approach offers a practical, scalable solution to reliable UQ in real-world LLM deployments, enabling safer and more dependable AI systems without extensive hyperparameter tuning.
Abstract
Uncertainty quantification (UQ) in Large Language Models (LLMs) is essential for their safe and reliable deployment, particularly in critical applications where incorrect outputs can have serious consequences. Current UQ methods typically rely on querying the model multiple times using non-zero temperature sampling to generate diverse outputs for uncertainty estimation. However, the impact of selecting a given temperature parameter is understudied, and our analysis reveals that temperature plays a fundamental role in the quality of uncertainty estimates. The conventional approach of identifying optimal temperature values requires expensive hyperparameter optimization (HPO) that must be repeated for each new model-dataset combination. We propose Monte Carlo Temperature (MCT), a robust sampling strategy that eliminates the need for temperature calibration. Our analysis reveals that: 1) MCT provides more robust uncertainty estimates across a wide range of temperatures, 2) MCT improves the performance of UQ methods by replacing fixed-temperature strategies that do not rely on HPO, and 3) MCT achieves statistical parity with oracle temperatures, which represent the ideal outcome of a well-tuned but computationally expensive HPO process. These findings demonstrate that effective UQ can be achieved without the computational burden of temperature parameter calibration.
