No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes

Iván Vicente Moreno Cencerrado; Arnau Padrés Masdemont; Anton Gonzalvez Hawthorne; David Demitri Africa; Lorenzo Pacchiardi

No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes

Iván Vicente Moreno Cencerrado, Arnau Padrés Masdemont, Anton Gonzalvez Hawthorne, David Demitri Africa, Lorenzo Pacchiardi

TL;DR

This work investigates whether a large language model's hidden activations contain a linear signal that predicts the correctness of its forthcoming answer, using activations immediately after a question is processed but before generation. A simple difference-of-means linear probe identifies a latent 'in-advance correctness direction' along which correct and incorrect question outputs are linearly separable, with performance evaluated via AUROC across multiple models and datasets. The direction generalizes to several factual knowledge datasets but not to arithmetic reasoning, and its strength increases with model size, while layerwise emergence occurs in intermediate layers. The findings offer a low-cost internal signal that could inform safety measures, such as early stopping or fallback, and advance understanding of how LLMs internally assess their own capabilities, albeit with limitations in arithmetic tasks and potential dataset-specific biases.

Abstract

Do large language models (LLMs) anticipate when they will answer correctly? To study this, we extract activations after a question is read but before any tokens are generated, and train linear probes to predict whether the model's forthcoming answer will be correct. Across three open-source model families ranging from 7 to 70 billion parameters, projections on this "in-advance correctness direction" trained on generic trivia questions predict success in distribution and on diverse out-of-distribution knowledge datasets, indicating a deeper signal than dataset-specific spurious features, and outperforming black-box baselines and verbalised predicted confidence. Predictive power saturates in intermediate layers and, notably, generalisation falters on questions requiring mathematical reasoning. Moreover, for models responding "I don't know", doing so strongly correlates with the probe score, indicating that the same direction also captures confidence. By complementing previous results on truthfulness and other behaviours obtained with probes and sparse auto-encoders, our work contributes essential findings to elucidate LLM internals.

No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes

TL;DR

Abstract

No Answer Needed: Predicting LLM Answer Accuracy from Question-Only Linear Probes

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (9)