Automated Concept Discovery for LLM-as-a-Judge Preference Analysis

James Wedgwood; Chhavi Yadav; Virginia Smith

Automated Concept Discovery for LLM-as-a-Judge Preference Analysis

James Wedgwood, Chhavi Yadav, Virginia Smith

TL;DR

The results show that automated concept discovery enables systematic analysis of LLM judge preferences without predefined bias taxonomies, and that sparse autoencoder-based approaches recover substantially more interpretable preference features than alternatives while remaining competitive in predicting LLM decisions.

Abstract

Large Language Models (LLMs) are increasingly used as scalable evaluators of model outputs, but their preference judgments exhibit systematic biases and can diverge from human evaluations. Prior work on LLM-as-a-judge has largely focused on a small, predefined set of hypothesized biases, leaving open the problem of automatically discovering unknown drivers of LLM preferences. We address this gap by studying several embedding-level concept extraction methods for analyzing LLM judge behavior. We compare these methods in terms of interpretability and predictiveness, finding that sparse autoencoder-based approaches recover substantially more interpretable preference features than alternatives while remaining competitive in predicting LLM decisions. Using over 27k paired responses from multiple human preference datasets and judgments from three LLMs, we analyze LLM judgments and compare them to those of human annotators. Our method both validates existing results, such as the tendency for LLMs to prefer refusal of sensitive requests at higher rates than humans, and uncovers new trends across both general and domain-specific datasets, including biases toward responses that emphasize concreteness and empathy in approaching new situations, toward detail and formality in academic advice, and against legal guidance that promotes active steps like calling police and filing lawsuits. Our results show that automated concept discovery enables systematic analysis of LLM judge preferences without predefined bias taxonomies.

Automated Concept Discovery for LLM-as-a-Judge Preference Analysis

TL;DR

Abstract

Paper Structure (39 sections, 3 equations, 6 figures, 2 tables)

This paper contains 39 sections, 3 equations, 6 figures, 2 tables.

Introduction
Related Work
LLM-as-a-judge preference analysis.
LLM versus human preferences.
Concept extraction and SAEs.
Methodology
Data Preparation
Concept Extraction
Feature Interpretation
Results
Method Comparison
Differential SAE Feature Analysis
General analysis.
Domain-specific dataset analysis.
Conclusion
...and 24 more sections

Figures (6)

Figure 1: High-level overview of methodology. Embeddings and LLM judgments are generated from a paired preference dataset, differential features are extracted, and interpretations of these features are generated for further study.
Figure 2: A selection of Differential SAE features for the combined preference dataset, with interpretations and $\Delta$win-rate for human and LLM annotators.
Figure 3: Selected features for the legaladvice dataset, with interpretations and $\Delta$win-rate.
Figure 4: All Differential SAE features for the combined preference dataset, with interpretations and $\Delta$win-rate for human and LLM annotators.
Figure 5: All features for the askacademia dataset.
...and 1 more figures

Automated Concept Discovery for LLM-as-a-Judge Preference Analysis

TL;DR

Abstract

Automated Concept Discovery for LLM-as-a-Judge Preference Analysis

Authors

TL;DR

Abstract

Table of Contents

Figures (6)