Table of Contents
Fetching ...

Aligning Black-box Language Models with Human Judgments

Gerrit J. J. van den Burg, Gen Suzuki, Wei Liu, Murat Sensoy

TL;DR

This work tackles the misalignment between large language models (LLMs) and human judgments in subjective, ordinal evaluation tasks. It introduces a black-box alignment framework that learns a simple linear mapping from the LLMs' output space to human labels using a small calibration set, without retraining the LLM or accessing logits. The method delivers substantial gains, achieving an average 142% improvement in agreement across 29 tasks and enabling smaller models to rival larger ones, with strong performance even in zero-shot and few-shot settings. It also demonstrates transferability of alignments across related tasks and shows that alignment can outperform or match inter-human agreement in several cases, underscoring its practical value for scalable, human-centered evaluation.

Abstract

Large language models (LLMs) are increasingly used as automated judges to evaluate recommendation systems, search engines, and other subjective tasks, where relying on human evaluators can be costly, time-consuming, and unscalable. LLMs offer an efficient solution for continuous, automated evaluation. However, since the systems that are built and improved with these judgments are ultimately designed for human use, it is crucial that LLM judgments align closely with human evaluators to ensure such systems remain human-centered. On the other hand, aligning LLM judgments with human evaluators is challenging due to individual variability and biases in human judgments. We propose a simple yet effective framework to align LLM judgments with individual human evaluators or their aggregated judgments, without retraining or fine-tuning the LLM. Our approach learns a linear mapping between the LLM's outputs and human judgments, achieving over 142% average improvement in agreement across 29 tasks with only a small number of calibration examples used for training. Notably, our method works in zero-shot and few-shot settings, exceeds inter-human agreement on four out of six tasks, and enables smaller LLMs to achieve performance comparable to that of larger models.

Aligning Black-box Language Models with Human Judgments

TL;DR

This work tackles the misalignment between large language models (LLMs) and human judgments in subjective, ordinal evaluation tasks. It introduces a black-box alignment framework that learns a simple linear mapping from the LLMs' output space to human labels using a small calibration set, without retraining the LLM or accessing logits. The method delivers substantial gains, achieving an average 142% improvement in agreement across 29 tasks and enabling smaller models to rival larger ones, with strong performance even in zero-shot and few-shot settings. It also demonstrates transferability of alignments across related tasks and shows that alignment can outperform or match inter-human agreement in several cases, underscoring its practical value for scalable, human-centered evaluation.

Abstract

Large language models (LLMs) are increasingly used as automated judges to evaluate recommendation systems, search engines, and other subjective tasks, where relying on human evaluators can be costly, time-consuming, and unscalable. LLMs offer an efficient solution for continuous, automated evaluation. However, since the systems that are built and improved with these judgments are ultimately designed for human use, it is crucial that LLM judgments align closely with human evaluators to ensure such systems remain human-centered. On the other hand, aligning LLM judgments with human evaluators is challenging due to individual variability and biases in human judgments. We propose a simple yet effective framework to align LLM judgments with individual human evaluators or their aggregated judgments, without retraining or fine-tuning the LLM. Our approach learns a linear mapping between the LLM's outputs and human judgments, achieving over 142% average improvement in agreement across 29 tasks with only a small number of calibration examples used for training. Notably, our method works in zero-shot and few-shot settings, exceeds inter-human agreement on four out of six tasks, and enables smaller LLMs to achieve performance comparable to that of larger models.

Paper Structure

This paper contains 15 sections, 4 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Cumulative distribution function of response options for the judgment task of freitag2021experts, before and after label alignment. The task is to grade translation quality on a rating scale from 0 to 6, i.e., nonsense ($0$), some meaning preserved ($2$), most meaning preserved ($4$), and perfect ($6$). Figure (a) illustrates the different response styles by human and LLM judges and highlights that LLMs primarily use highly positive labels, in contrast to human evaluators. Figure (b) shows the same graph after aligning the LLM responses to the average human judgment using our approach, and clearly demonstrates that we can align LLM judgments to human ones.
  • Figure 2: Test accuracy for Medical Safety (response type) dataset as we increase the number of training examples per judgment category.
  • Figure 3: Example (top) prompt for one of the judgment tasks (medical-safety: response type), where {{ examples }} is included for in-context learning experiments only. The {{ examples }} placeholder is replaced with one example for each output label following the example format (bottom) of the corresponding task.