Table of Contents
Fetching ...

Improving Preference Extraction In LLMs By Identifying Latent Knowledge Through Classifying Probes

Sharan Maiya, Yinhong Liu, Ramit Debnath, Anna Korhonen

TL;DR

The paper tackles biases and inefficiencies in LLM-based evaluation by introducing linear classifying probes trained on contrast pairs to explicitly access latent judgement. Both supervised and unsupervised probes demonstrate superior alignment with human judgments compared to generation-based scoring, with unsupervised probes offering strong robustness and efficiency, and supervised probes yielding further gains with modest labeled data. Across multiple model families and diverse datasets, probes generalize well under distributional shifts and can even surpass finetuned evaluators under similar data budgets. This approach provides interpretable insights into how models encode judgement, enabling cost-effective, robust LLM-as-a-judge applications with broad practical impact.

Abstract

Large Language Models (LLMs) are often used as automated judges to evaluate text, but their effectiveness can be hindered by various unintentional biases. We propose using linear classifying probes, trained by leveraging differences between contrasting pairs of prompts, to directly access LLMs' latent knowledge and extract more accurate preferences. Through extensive experiments using models of varying size from four different families and six diverse datasets assessing text quality evaluation and common sense reasoning, we demonstrate that both supervised and unsupervised probing approaches consistently outperform traditional generation-based judgement while maintaining similar computational costs. These probes generalise under domain shifts and can even outperform finetuned evaluators with the same training data size. Our results suggest linear probing offers an accurate, robust and computationally efficient approach for LLM-as-judge tasks while providing interpretable insights into how models encode judgement-relevant knowledge. Our data and code will be openly released in the future.

Improving Preference Extraction In LLMs By Identifying Latent Knowledge Through Classifying Probes

TL;DR

The paper tackles biases and inefficiencies in LLM-based evaluation by introducing linear classifying probes trained on contrast pairs to explicitly access latent judgement. Both supervised and unsupervised probes demonstrate superior alignment with human judgments compared to generation-based scoring, with unsupervised probes offering strong robustness and efficiency, and supervised probes yielding further gains with modest labeled data. Across multiple model families and diverse datasets, probes generalize well under distributional shifts and can even surpass finetuned evaluators under similar data budgets. This approach provides interpretable insights into how models encode judgement, enabling cost-effective, robust LLM-as-a-judge applications with broad practical impact.

Abstract

Large Language Models (LLMs) are often used as automated judges to evaluate text, but their effectiveness can be hindered by various unintentional biases. We propose using linear classifying probes, trained by leveraging differences between contrasting pairs of prompts, to directly access LLMs' latent knowledge and extract more accurate preferences. Through extensive experiments using models of varying size from four different families and six diverse datasets assessing text quality evaluation and common sense reasoning, we demonstrate that both supervised and unsupervised probing approaches consistently outperform traditional generation-based judgement while maintaining similar computational costs. These probes generalise under domain shifts and can even outperform finetuned evaluators with the same training data size. Our results suggest linear probing offers an accurate, robust and computationally efficient approach for LLM-as-judge tasks while providing interpretable insights into how models encode judgement-relevant knowledge. Our data and code will be openly released in the future.

Paper Structure

This paper contains 17 sections, 3 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Our method exploits the empirical result that LLMs' internal features of "belief" or "judgement" are correlated with linear directions in their embedding spaces. For Llama 3.1 70B evaluated on the MT-Bench dataset, we find the first principal component of the contrast pair differences of embedding vectors roughly classifies which model in a given example was preferred by a panel of human raters(left). Supervised or unsupervised classifying probes built on these embedding vectors are more aligned with human raters than prompting methods alone(right), and this result holds across different model families (Gemma 2, Llama 3.1) at different sizes (from 2B to 70B parameters).
  • Figure 2: Unsupervised probes, in all but one test case, outperform generation-based methods like direct-scoring and pairwise comparisons. Interestingly, within a given model family, unsupervised probing performance with a small model almost always outperforms prompting performance with much larger models. This highlights two related key findings: (1) the use of relatively large LLMs for LLM-as-a-Judge tasks may be unnecessarily computationally wasteful and (2) there may be significant capability "left on the table" with smaller LLMs for such tasks.
  • Figure 3: Supervised probes, in all cases, allow for a further improvement in alignment with human raters over unsupervised probes. We also test parameter-efficient and full finetuning of models in the Gemma 2 family, finding that supervised probes still outperform finetuned generation-based evaluators.
  • Figure 4: Performance of classifying probes and generation-based prompting for Llama 3.1 70B on the LLMBar dataset. All three methods suffer under adversarial prompting (non-bold subsets), however, both probing approaches remain significantly more robust than prompting.
  • Figure 5: Taking the example of Llama 3.1 70B, we find most supervised probes are dissimilar while most unsupervised probes are similar (up to sign), regardless of the varying tasks in each of the six datasets considered.