Table of Contents
Fetching ...

Do LLMs Signal When They're Right? Evidence from Neuron Agreement

Kang Chen, Yaoning Wang, Kai Xiong, Zhuoka Feng, Wenhe Sun, Haotian Chen, Yixin Cao

TL;DR

This work tackles the problem of calibrating and enriching ensemble decoding for LLMs without relying on ground-truth labels or external outputs. It introduces Neuron-Agreement Decoding (NAD), an unsupervised method that selects high-quality trajectories based on internal neuron activations, enabling correctness prediction within the first $32$ tokens and supporting early stopping. Empirical results show NAD matches or surpasses majority voting on verifiable math/science tasks and outperforms Avg@64 on open-ended code, while dramatically reducing token usage by up to $99\%$. The findings demonstrate that internal activation patterns provide reliable, scalable guidance for label-free ensemble decoding, with practical implications for efficient inference in diverse reasoning tasks.

Abstract

Large language models (LLMs) commonly boost reasoning via sample-evaluate-ensemble decoders, achieving label free gains without ground truth. However, prevailing strategies score candidates using only external outputs such as token probabilities, entropies, or self evaluations, and these signals can be poorly calibrated after post training. We instead analyze internal behavior based on neuron activations and uncover three findings: (1) external signals are low dimensional projections of richer internal dynamics; (2) correct responses activate substantially fewer unique neurons than incorrect ones throughout generation; and (3) activations from correct responses exhibit stronger cross sample agreement, whereas incorrect ones diverge. Motivated by these observations, we propose Neuron Agreement Decoding (NAD), an unsupervised best-of-N method that selects candidates using activation sparsity and cross sample neuron agreement, operating solely on internal signals and without requiring comparable textual outputs. NAD enables early correctness prediction within the first 32 generated tokens and supports aggressive early stopping. Across math and science benchmarks with verifiable answers, NAD matches majority voting; on open ended coding benchmarks where majority voting is inapplicable, NAD consistently outperforms Avg@64. By pruning unpromising trajectories early, NAD reduces token usage by 99% with minimal loss in generation quality, showing that internal signals provide reliable, scalable, and efficient guidance for label free ensemble decoding.

Do LLMs Signal When They're Right? Evidence from Neuron Agreement

TL;DR

This work tackles the problem of calibrating and enriching ensemble decoding for LLMs without relying on ground-truth labels or external outputs. It introduces Neuron-Agreement Decoding (NAD), an unsupervised method that selects high-quality trajectories based on internal neuron activations, enabling correctness prediction within the first tokens and supporting early stopping. Empirical results show NAD matches or surpasses majority voting on verifiable math/science tasks and outperforms Avg@64 on open-ended code, while dramatically reducing token usage by up to . The findings demonstrate that internal activation patterns provide reliable, scalable guidance for label-free ensemble decoding, with practical implications for efficient inference in diverse reasoning tasks.

Abstract

Large language models (LLMs) commonly boost reasoning via sample-evaluate-ensemble decoders, achieving label free gains without ground truth. However, prevailing strategies score candidates using only external outputs such as token probabilities, entropies, or self evaluations, and these signals can be poorly calibrated after post training. We instead analyze internal behavior based on neuron activations and uncover three findings: (1) external signals are low dimensional projections of richer internal dynamics; (2) correct responses activate substantially fewer unique neurons than incorrect ones throughout generation; and (3) activations from correct responses exhibit stronger cross sample agreement, whereas incorrect ones diverge. Motivated by these observations, we propose Neuron Agreement Decoding (NAD), an unsupervised best-of-N method that selects candidates using activation sparsity and cross sample neuron agreement, operating solely on internal signals and without requiring comparable textual outputs. NAD enables early correctness prediction within the first 32 generated tokens and supports aggressive early stopping. Across math and science benchmarks with verifiable answers, NAD matches majority voting; on open ended coding benchmarks where majority voting is inapplicable, NAD consistently outperforms Avg@64. By pruning unpromising trajectories early, NAD reduces token usage by 99% with minimal loss in generation quality, showing that internal signals provide reliable, scalable, and efficient guidance for label free ensemble decoding.

Paper Structure

This paper contains 19 sections, 9 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Comparison between NAD and other ensemble methods. Our approach relies solely on internal signals during the sampling process, without requiring comparable textual outputs.
  • Figure 2: Scatter plots of the number of activated neurons versus confidence-based metrics (Self-Certainty and Entropy). The neuron counts show significant correlations with both metrics, indicating that the activated neuron states provide a high-dimensional representation of traditional confidence measures.
  • Figure 3: t-SNE representation of activated neurons, with point colors indicating the average entropy of the corresponding samples. No clear consistency is observed between entropy and the resulting clusters, suggesting that the activated neurons contain high-dimensional structural information not captured by entropy.
  • Figure 4: Preliminary AIME24 results. (a) t-SNE of responses to one prompt: center clusters share similar reasoning; outliers diverge. (b) Correct answers activate far fewer neurons than incorrect ones. (c) Token-wise trajectories show incorrect responses repeatedly shift strategies, engaging more neurons. These observations motivate Insight 1 and Insight 2 presented in Section \ref{['sec:pre_experiment']}.
  • Figure 5: Framework of Neuron Agreement Decoding (NAD). NAD selects high-quality answers by leveraging the consensus of internal neuron activations during the sampling process, without relying on canonical textual outputs. This consistency can be identified using the proposed kNN-based approach, among others. Moreover, this procedure can be applied at an early stage of sequence generation, pruning low-quality responses in advance and reducing token usage.
  • ...and 5 more figures