Can Linear Probes Measure LLM Uncertainty?
Ramzi Dakhmouche, Adrien Letellier, Hossein Gorji
TL;DR
The paper tackles the challenge of reliable uncertainty quantification for large language models in discrete-choice tasks, where the standard maximum softmax score is insufficient. It introduces Bayesian Linear Lens (BLL), a lightweight framework that learns layer-wise Bayesian linear models to approximate activations conditioned on truthfulness and then combines these layer-level posteriors through sparse regression to obtain a global uncertainty score. The authors compare two feature designs—posterior log-likelihoods and log-likelihood ratios—and demonstrate that the approach yields consistent AUROC gains over MSP baselines across multiple LLMs on the MMLU dataset, with additional gains for open-ended questions on TriviaQA. The work highlights that principled Bayesian analysis of simple linear probes can achieve strong uncertainty quantification without heavy ensembles, supporting scalable, interpretable UQ for safe AI deployment.
Abstract
Effective Uncertainty Quantification (UQ) represents a key aspect for reliable deployment of Large Language Models (LLMs) in automated decision-making and beyond. Yet, for LLM generation with multiple choice structure, the state-of-the-art in UQ is still dominated by the naive baseline given by the maximum softmax score. To address this shortcoming, we demonstrate that taking a principled approach via Bayesian statistics leads to improved performance despite leveraging the simplest possible model, namely linear regression. More precisely, we propose to train multiple Bayesian linear models, each predicting the output of a layer given the output of the previous one. Based on the obtained layer-level posterior distributions, we infer the global uncertainty level of the LLM by identifying a sparse combination of distributional features, leading to an efficient UQ scheme. Numerical experiments on various LLMs show consistent improvement over state-of-the-art baselines.
