Table of Contents
Fetching ...

Automated Concept Discovery for LLM-as-a-Judge Preference Analysis

James Wedgwood, Chhavi Yadav, Virginia Smith

TL;DR

The results show that automated concept discovery enables systematic analysis of LLM judge preferences without predefined bias taxonomies, and that sparse autoencoder-based approaches recover substantially more interpretable preference features than alternatives while remaining competitive in predicting LLM decisions.

Abstract

Large Language Models (LLMs) are increasingly used as scalable evaluators of model outputs, but their preference judgments exhibit systematic biases and can diverge from human evaluations. Prior work on LLM-as-a-judge has largely focused on a small, predefined set of hypothesized biases, leaving open the problem of automatically discovering unknown drivers of LLM preferences. We address this gap by studying several embedding-level concept extraction methods for analyzing LLM judge behavior. We compare these methods in terms of interpretability and predictiveness, finding that sparse autoencoder-based approaches recover substantially more interpretable preference features than alternatives while remaining competitive in predicting LLM decisions. Using over 27k paired responses from multiple human preference datasets and judgments from three LLMs, we analyze LLM judgments and compare them to those of human annotators. Our method both validates existing results, such as the tendency for LLMs to prefer refusal of sensitive requests at higher rates than humans, and uncovers new trends across both general and domain-specific datasets, including biases toward responses that emphasize concreteness and empathy in approaching new situations, toward detail and formality in academic advice, and against legal guidance that promotes active steps like calling police and filing lawsuits. Our results show that automated concept discovery enables systematic analysis of LLM judge preferences without predefined bias taxonomies.

Automated Concept Discovery for LLM-as-a-Judge Preference Analysis

TL;DR

The results show that automated concept discovery enables systematic analysis of LLM judge preferences without predefined bias taxonomies, and that sparse autoencoder-based approaches recover substantially more interpretable preference features than alternatives while remaining competitive in predicting LLM decisions.

Abstract

Large Language Models (LLMs) are increasingly used as scalable evaluators of model outputs, but their preference judgments exhibit systematic biases and can diverge from human evaluations. Prior work on LLM-as-a-judge has largely focused on a small, predefined set of hypothesized biases, leaving open the problem of automatically discovering unknown drivers of LLM preferences. We address this gap by studying several embedding-level concept extraction methods for analyzing LLM judge behavior. We compare these methods in terms of interpretability and predictiveness, finding that sparse autoencoder-based approaches recover substantially more interpretable preference features than alternatives while remaining competitive in predicting LLM decisions. Using over 27k paired responses from multiple human preference datasets and judgments from three LLMs, we analyze LLM judgments and compare them to those of human annotators. Our method both validates existing results, such as the tendency for LLMs to prefer refusal of sensitive requests at higher rates than humans, and uncovers new trends across both general and domain-specific datasets, including biases toward responses that emphasize concreteness and empathy in approaching new situations, toward detail and formality in academic advice, and against legal guidance that promotes active steps like calling police and filing lawsuits. Our results show that automated concept discovery enables systematic analysis of LLM judge preferences without predefined bias taxonomies.
Paper Structure (39 sections, 3 equations, 6 figures, 2 tables)

This paper contains 39 sections, 3 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: High-level overview of methodology. Embeddings and LLM judgments are generated from a paired preference dataset, differential features are extracted, and interpretations of these features are generated for further study.
  • Figure 2: A selection of Differential SAE features for the combined preference dataset, with interpretations and $\Delta$win-rate for human and LLM annotators.
  • Figure 3: Selected features for the legaladvice dataset, with interpretations and $\Delta$win-rate.
  • Figure 4: All Differential SAE features for the combined preference dataset, with interpretations and $\Delta$win-rate for human and LLM annotators.
  • Figure 5: All features for the askacademia dataset.
  • ...and 1 more figures