Table of Contents
Fetching ...

On Evaluating LLM Alignment by Evaluating LLMs as Judges

Yixin Liu, Pengfei Liu, Arman Cohan

TL;DR

This work defines generation-evaluation consistency (GE-consistency) to study the link between LLMs' generation quality and their ability to evaluate others' outputs. By using a strong preference oracle, the authors show high GE-consistency across models and tasks, enabling a benchmarking paradigm (AlignEval) that assesses alignment via evaluation capability rather than direct output evaluation. AlignEval, including variants with IFEval, achieves performance competitive with LLM-judge baselines while significantly reducing evaluation cost. The findings imply that evaluation skills are a meaningful indicator of alignment quality and open avenues for self-improvement and more scalable benchmarking, with caveats about the proxy nature of evaluation-based methods and potential adversarial risks.

Abstract

Alignment with human preferences is an important evaluation aspect of LLMs, requiring them to be helpful, honest, safe, and to precisely follow human instructions. Evaluating large language models' (LLMs) alignment typically involves directly assessing their open-ended responses, requiring human annotators or strong LLM judges. Conversely, LLMs themselves have also been extensively evaluated as judges for assessing alignment. In this work, we examine the relationship between LLMs' generation and evaluation capabilities in aligning with human preferences. To this end, we first conduct a comprehensive analysis of the generation-evaluation consistency (GE-consistency) among various LLMs, revealing a strong correlation between their generation and evaluation capabilities when evaluated by a strong LLM preference oracle. Utilizing this finding, we propose a benchmarking paradigm that measures LLM alignment with human preferences without directly evaluating their generated outputs, instead assessing LLMs in their role as evaluators. Our evaluation shows that our proposed benchmark, AlignEval, matches or surpasses widely used automatic LLM evaluation benchmarks, such as AlpacaEval and Arena-Hard, in capturing human preferences when ranking LLMs. Our study offers valuable insights into the connection between LLMs' generation and evaluation capabilities, and introduces a benchmark that assesses alignment without directly evaluating model outputs.

On Evaluating LLM Alignment by Evaluating LLMs as Judges

TL;DR

This work defines generation-evaluation consistency (GE-consistency) to study the link between LLMs' generation quality and their ability to evaluate others' outputs. By using a strong preference oracle, the authors show high GE-consistency across models and tasks, enabling a benchmarking paradigm (AlignEval) that assesses alignment via evaluation capability rather than direct output evaluation. AlignEval, including variants with IFEval, achieves performance competitive with LLM-judge baselines while significantly reducing evaluation cost. The findings imply that evaluation skills are a meaningful indicator of alignment quality and open avenues for self-improvement and more scalable benchmarking, with caveats about the proxy nature of evaluation-based methods and potential adversarial risks.

Abstract

Alignment with human preferences is an important evaluation aspect of LLMs, requiring them to be helpful, honest, safe, and to precisely follow human instructions. Evaluating large language models' (LLMs) alignment typically involves directly assessing their open-ended responses, requiring human annotators or strong LLM judges. Conversely, LLMs themselves have also been extensively evaluated as judges for assessing alignment. In this work, we examine the relationship between LLMs' generation and evaluation capabilities in aligning with human preferences. To this end, we first conduct a comprehensive analysis of the generation-evaluation consistency (GE-consistency) among various LLMs, revealing a strong correlation between their generation and evaluation capabilities when evaluated by a strong LLM preference oracle. Utilizing this finding, we propose a benchmarking paradigm that measures LLM alignment with human preferences without directly evaluating their generated outputs, instead assessing LLMs in their role as evaluators. Our evaluation shows that our proposed benchmark, AlignEval, matches or surpasses widely used automatic LLM evaluation benchmarks, such as AlpacaEval and Arena-Hard, in capturing human preferences when ranking LLMs. Our study offers valuable insights into the connection between LLMs' generation and evaluation capabilities, and introduces a benchmark that assesses alignment without directly evaluating model outputs.

Paper Structure

This paper contains 18 sections, 1 equation, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Illustration of Generation-Evaluation Consistency (GE-consistency), where LLMs' generation and evaluation capability rankings are compared using a preference oracle.
  • Figure 2: Generation and evaluation performance of various LLMs with gpt-4o-2024-08-26 as the preference oracle. The X-axis shows the generation performance in terms of LLMs' win rates against the baseline system (GPT-4) evaluated by the preference oracle. The Y-axis shows the evaluation performance in terms of LLMs' agreement rate (Cohen's Kappa) with the preference oracle on filtered evaluation task instances.
  • Figure 3: The GE-consistency with different LLMs as the preference oracle. Spearman's correlation between the generation and evaluation capability rankings of LLMs under different preference oracles is shown on the Y-axis. The preference oracles are sorted in ascending order of their corresponding GE-consistency levels on the X-axis.
  • Figure 4: Prompt template for evaluating LLMs as Judges.
  • Figure 5: Generation and evaluation performance of various LLMs with gpt-4o-2024-08-26 as the preference oracle. The X-axis shows the generation performance in terms of LLMs' win rates against the baseline system (GPT-4) evaluated by the preference oracle. The Y-axis shows the evaluation performance in terms of LLMs' agreement rate (Cohen's Kappa) with the preference oracle on filtered evaluation task instances.
  • ...and 1 more figures