Table of Contents
Fetching ...

Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking

Benjamin Feuer, Micah Goldblum, Teresa Datta, Sanjana Nambiar, Raz Besaleli, Samuel Dooley, Max Cembalest, John P. Dickerson

TL;DR

This work critically examines whether LLM-judge preferences reflect real progress on concrete alignment objectives. It introduces SOS-Bench, a large ground-truth meta-benchmark, and reveals that LLM judges are heavily biased toward stylistic factors, with data scaling during supervised fine-tuning and prompt diversity driving most alignment gains rather than preference optimization methods. Two-stage post-training can degrade world knowledge, while improvements on LLM-judge benchmarks often outpace gains on SOS-Bench, highlighting a misalignment between judge signals and real-world safety, knowledge, and instruction-following metrics. The authors urge precise, domain-focused benchmarks and a shift away from single-dimension judgments, promoting reproducible, scalable evaluation strategies like SOS-Bench to guide robust alignment progress.

Abstract

The release of ChatGPT in November 2022 sparked an explosion of interest in post-training and an avalanche of new preference optimization (PO) methods. These methods claim superior alignment by virtue of better correspondence with human pairwise preferences, often measured by LLM-judges. In this work, we attempt to answer the following question -- do LLM-judge preferences translate to progress on other, more concrete metrics for alignment, and if not, why not? We define a concrete metric for alignment, and introduce SOS-Bench (Substance Outweighs Style Benchmark), which is to the best of our knowledge the largest standardized, reproducible LLM meta-benchmark to date. We find that (1) LLM-judge preferences do not correlate with concrete measures of safety, world knowledge, and instruction following; (2) LLM-judges have powerful implicit biases, prioritizing style over factuality and safety; and (3) the supervised fine-tuning (SFT) stage of post-training, and not the PO stage, has the greatest impact on alignment, with data scaling and prompt diversity as the driving factors. Our codebase and complete results can be found at https://github.com/penfever/sos-bench.

Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking

TL;DR

This work critically examines whether LLM-judge preferences reflect real progress on concrete alignment objectives. It introduces SOS-Bench, a large ground-truth meta-benchmark, and reveals that LLM judges are heavily biased toward stylistic factors, with data scaling during supervised fine-tuning and prompt diversity driving most alignment gains rather than preference optimization methods. Two-stage post-training can degrade world knowledge, while improvements on LLM-judge benchmarks often outpace gains on SOS-Bench, highlighting a misalignment between judge signals and real-world safety, knowledge, and instruction-following metrics. The authors urge precise, domain-focused benchmarks and a shift away from single-dimension judgments, promoting reproducible, scalable evaluation strategies like SOS-Bench to guide robust alignment progress.

Abstract

The release of ChatGPT in November 2022 sparked an explosion of interest in post-training and an avalanche of new preference optimization (PO) methods. These methods claim superior alignment by virtue of better correspondence with human pairwise preferences, often measured by LLM-judges. In this work, we attempt to answer the following question -- do LLM-judge preferences translate to progress on other, more concrete metrics for alignment, and if not, why not? We define a concrete metric for alignment, and introduce SOS-Bench (Substance Outweighs Style Benchmark), which is to the best of our knowledge the largest standardized, reproducible LLM meta-benchmark to date. We find that (1) LLM-judge preferences do not correlate with concrete measures of safety, world knowledge, and instruction following; (2) LLM-judges have powerful implicit biases, prioritizing style over factuality and safety; and (3) the supervised fine-tuning (SFT) stage of post-training, and not the PO stage, has the greatest impact on alignment, with data scaling and prompt diversity as the driving factors. Our codebase and complete results can be found at https://github.com/penfever/sos-bench.
Paper Structure (39 sections, 5 figures, 12 tables)

This paper contains 39 sections, 5 figures, 12 tables.

Figures (5)

  • Figure 1: The LLM-judge pipeline introduces new potential confounds in evaluation, compared to standard benchmarks. We diagram the LLM-judge pipeline for alignment benchmarking and observe that it is more complex than that of most standard benchmarks; (a) it replaces an explainable, deterministic metric with an opaque LLM-judge. (b) it does not attempt to establish any verifiable ground truth. (c) it contains a relatively small number of questions covering an very wide range of topics, resulting in limited coverage of any particular knowledge domain. (d) it introduces novel confounds in the form of the judging template (explicit bias) and the judge's unstated internal preferences (implicit bias).
  • Figure 2: Judges implicitly reweight explicit criteria. When asked to render an overall judgment using a set of explicit criteria, models will implicitly weight some of those criteria more than others. We report the LLM's overall judgment as Arena-Hard Score, alongside independent LLM judgments of five key factors in the response. Style is perfectly correlated with the overall score (Pearson's R).
  • Figure 3: More is more in alignment. In the SFT stage of post-training, the size of the dataset, rather than the method used to curate the data, is the strongest predictor of alignment. We report average normalized accuracy on the y axis, and dataset size (in 1000s) on the X axis. The shaded region represents 95% confidence intervals.
  • Figure 4: More is more in alignment. ipso facto
  • Figure 5: Comparative impact of changes to the Arena-Hard-Auto methodology. The last row represents the strength of correlation between the ablation and the base (Pearson's R). We observe the highest sensitivity when changing the baseline model used for pairwise comparisons. Interestingly, changing the questions does not have a particularly strong effect on the rank order of models, suggesting that what LLM judges measure is not particularly attuned to subject matter.