Table of Contents
Fetching ...

Lightweight Latent Verifiers for Efficient Meta-Generation Strategies

Bartosz Piotrowski, Witold Drzewakowski, Konrad Staniszewski, Piotr Miłoś

TL;DR

This work tackles the high cost of verifiers in reasoning-focused LLM tasks by introducing LiLaVe, a lightweight latent verifier that extracts correctness signals from the base model's hidden states. A small XGBoost-based classifier is trained on fixed hidden-state locations to predict final answer correctness, and per-state scores are averaged to yield a LiLaVe score used to drive meta-generation strategies such as conditional majority voting and conditional self-correction. Across GSM8K, GSM-Symbolic, algebra_linear_1d, and MATH benchmarks, LiLaVe achieves strong discriminatory power (AUC) with substantially fewer training examples than LLM-based verifiers, enabling significant inference-time efficiency gains. The study also demonstrates transferability across datasets and base models, and proposes practical strategies to balance accuracy and compute, marking a step toward scalable, resource-efficient reasoning in real-world settings.

Abstract

Verifiers are auxiliary models that assess the correctness of outputs generated by base large language models (LLMs). They play a crucial role in many strategies for solving reasoning-intensive problems with LLMs. Typically, verifiers are LLMs themselves, often as large (or larger) than the base model they support, making them computationally expensive. In this work, we introduce a novel lightweight verification approach, LiLaVe, which reliably extracts correctness signals from the hidden states of the base LLM. A key advantage of LiLaVe is its ability to operate with only a small fraction of the computational budget required by traditional LLM-based verifiers. To demonstrate its practicality, we couple LiLaVe with popular meta-generation strategies, like best-of-n or self-consistency. Moreover, we design novel LiLaVe-based approaches, like conditional self-correction or conditional majority voting, that significantly improve both accuracy and efficiency in generation tasks with smaller LLMs. Our work demonstrates the fruitfulness of extracting latent information from the hidden states of LLMs, and opens the door to scalable and resource-efficient solutions for reasoning-intensive applications.

Lightweight Latent Verifiers for Efficient Meta-Generation Strategies

TL;DR

This work tackles the high cost of verifiers in reasoning-focused LLM tasks by introducing LiLaVe, a lightweight latent verifier that extracts correctness signals from the base model's hidden states. A small XGBoost-based classifier is trained on fixed hidden-state locations to predict final answer correctness, and per-state scores are averaged to yield a LiLaVe score used to drive meta-generation strategies such as conditional majority voting and conditional self-correction. Across GSM8K, GSM-Symbolic, algebra_linear_1d, and MATH benchmarks, LiLaVe achieves strong discriminatory power (AUC) with substantially fewer training examples than LLM-based verifiers, enabling significant inference-time efficiency gains. The study also demonstrates transferability across datasets and base models, and proposes practical strategies to balance accuracy and compute, marking a step toward scalable, resource-efficient reasoning in real-world settings.

Abstract

Verifiers are auxiliary models that assess the correctness of outputs generated by base large language models (LLMs). They play a crucial role in many strategies for solving reasoning-intensive problems with LLMs. Typically, verifiers are LLMs themselves, often as large (or larger) than the base model they support, making them computationally expensive. In this work, we introduce a novel lightweight verification approach, LiLaVe, which reliably extracts correctness signals from the hidden states of the base LLM. A key advantage of LiLaVe is its ability to operate with only a small fraction of the computational budget required by traditional LLM-based verifiers. To demonstrate its practicality, we couple LiLaVe with popular meta-generation strategies, like best-of-n or self-consistency. Moreover, we design novel LiLaVe-based approaches, like conditional self-correction or conditional majority voting, that significantly improve both accuracy and efficiency in generation tasks with smaller LLMs. Our work demonstrates the fruitfulness of extracting latent information from the hidden states of LLMs, and opens the door to scalable and resource-efficient solutions for reasoning-intensive applications.

Paper Structure

This paper contains 50 sections, 13 equations, 18 figures, 4 tables.

Figures (18)

  • Figure 1: Predictive performance of LiLaVe on individual locations of hidden states determined by the indices of the transformer's layer and the sequence's token. We test the tokens from the prefix and suffix of the generated sequences, both of length 32. It is visible that the higher-quality signal can be retrieved from the final tokens; however, interestingly, even for the first tokens, LiLaVe provides a signal significantly better than the random baseline (dashed lines). At the same time, we cannot conclude which transformer layers give the best signal.
  • Figure 2: Performance (AUC) of LiLaVe trained and evaluated on hidden states of Llama 3.1 8B answers to GSM8K questions with various temperature settings.
  • Figure 3: Best-of-$n$, majority voting, and weighted majority voting on four datasets. For each of the methods, the number of samples per question is varied between 1 and 16. Weighted majority voting performs best for all the datasets, but the margin differs across the datasets.
  • Figure 4: Conditional self-correction on four datasets. The dark points indicate the performance for different score thresholds. The left-most points correspond to no self-correction (and only one sample per question generated); The right-most points correspond to unconditional self-correction (and therefore two samples per question generated). The optimal thresholds are orange.
  • Figure 5: Conditional majority voting with varying threshold $s$ and the number of samples per question $n$ between 1 and 256. The parameter $n$ is shown implicitly as for fixed $s$ it influences the total number of generated samples through the number of dataset questions scored below $s$ (which for dataset $\mathcal{D}$ is equal $(n + 1) \cdot |\mathcal{D}|$; the additional one sample per example is the probe sample). Note that on the $x$-axis, a logarithmic scale is applied. In black, the baseline of standard majority voting is shown. Conditional majority voting outperforms the baseline on a wide range of generation budgets.
  • ...and 13 more figures