Table of Contents
Fetching ...

No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes

Iván Vicente Moreno Cencerrado, Arnau Padrés Masdemont, Anton Gonzalvez Hawthorne, David Demitri Africa, Lorenzo Pacchiardi

TL;DR

This work investigates whether a large language model's hidden activations contain a linear signal that predicts the correctness of its forthcoming answer, using activations immediately after a question is processed but before generation. A simple difference-of-means linear probe identifies a latent 'in-advance correctness direction' along which correct and incorrect question outputs are linearly separable, with performance evaluated via AUROC across multiple models and datasets. The direction generalizes to several factual knowledge datasets but not to arithmetic reasoning, and its strength increases with model size, while layerwise emergence occurs in intermediate layers. The findings offer a low-cost internal signal that could inform safety measures, such as early stopping or fallback, and advance understanding of how LLMs internally assess their own capabilities, albeit with limitations in arithmetic tasks and potential dataset-specific biases.

Abstract

Do large language models (LLMs) anticipate when they will answer correctly? To study this, we extract activations after a question is read but before any tokens are generated, and train linear probes to predict whether the model's forthcoming answer will be correct. Across three open-source model families ranging from 7 to 70 billion parameters, projections on this "in-advance correctness direction" trained on generic trivia questions predict success in distribution and on diverse out-of-distribution knowledge datasets, indicating a deeper signal than dataset-specific spurious features, and outperforming black-box baselines and verbalised predicted confidence. Predictive power saturates in intermediate layers and, notably, generalisation falters on questions requiring mathematical reasoning. Moreover, for models responding "I don't know", doing so strongly correlates with the probe score, indicating that the same direction also captures confidence. By complementing previous results on truthfulness and other behaviours obtained with probes and sparse auto-encoders, our work contributes essential findings to elucidate LLM internals.

No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes

TL;DR

This work investigates whether a large language model's hidden activations contain a linear signal that predicts the correctness of its forthcoming answer, using activations immediately after a question is processed but before generation. A simple difference-of-means linear probe identifies a latent 'in-advance correctness direction' along which correct and incorrect question outputs are linearly separable, with performance evaluated via AUROC across multiple models and datasets. The direction generalizes to several factual knowledge datasets but not to arithmetic reasoning, and its strength increases with model size, while layerwise emergence occurs in intermediate layers. The findings offer a low-cost internal signal that could inform safety measures, such as early stopping or fallback, and advance understanding of how LLMs internally assess their own capabilities, albeit with limitations in arithmetic tasks and potential dataset-specific biases.

Abstract

Do large language models (LLMs) anticipate when they will answer correctly? To study this, we extract activations after a question is read but before any tokens are generated, and train linear probes to predict whether the model's forthcoming answer will be correct. Across three open-source model families ranging from 7 to 70 billion parameters, projections on this "in-advance correctness direction" trained on generic trivia questions predict success in distribution and on diverse out-of-distribution knowledge datasets, indicating a deeper signal than dataset-specific spurious features, and outperforming black-box baselines and verbalised predicted confidence. Predictive power saturates in intermediate layers and, notably, generalisation falters on questions requiring mathematical reasoning. Moreover, for models responding "I don't know", doing so strongly correlates with the probe score, indicating that the same direction also captures confidence. By complementing previous results on truthfulness and other behaviours obtained with probes and sparse auto-encoders, our work contributes essential findings to elucidate LLM internals.

Paper Structure

This paper contains 35 sections, 2 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Proposed methodology to find the in-advance correctness direction. (A) Residual stream activations for all model layers are extracted at the last token of the question, prior to sampling. (B) Model answers are generated and evaluated against the ground truth. (C) The direction which mostly discriminates activations related to correct and incorrect answers is identified (the first two principal components at a specific layer are visualised). (D) The most discriminative layer is chosen. (E) The final correctness classifier is trained on the identified layer, and its out-of-distribution performance is assessed.
  • Figure 2: TriviaQA AUROC (average over 3 folds) across layers. We collect activations every 2 layers for small (<10B parameters) models and every 4 layers for large (>10B parameters) models.
  • Figure 3: AUROC scores on each dataset for the direction learned on each dataset individually, for two selected models (others in Appendix \ref{['sec:heatmaps']}). Average AUROC over 5 folds is reported (Section \ref{['sec:exp_gen']}).
  • Figure 4: Distribution of values of activation projections on the correctness direction from TriviaQA, grouped by produced answer (right, wrong,"I don't know"), for a selection of models and datasets.
  • Figure 5: AUROC scores for each model and test dataset for different number of training samples from TriviaQA, for our correctness direction approach. To reduce variance, 10 experiments were performed for each number of training samples and the average AUROC is reported. Notice that the x scale is logarithmic.
  • ...and 4 more figures