Table of Contents
Fetching ...

Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks

Rylan Schaeffer, Punit Singh Koura, Binh Tang, Ranjan Subramanian, Aaditya K Singh, Todor Mihaylov, Prajjwal Bhargava, Lovish Madaan, Niladri S. Chatterji, Vedanuj Goswami, Sergey Edunov, Dieuwke Hupkes, Sanmi Koyejo, Sharan Narang

TL;DR

The paper investigates how automated NLP benchmarks relate to human evaluations for chat LMs, using four Llama 2 variants across $X_{\text{NLP}} \in \mathbb{R}^{160 \times 4}$ and $X_{\text{Human}} \in \mathbb{R}^{55 \times 4}$ to assess correlations and predictive power. It finds strong overall correlations between benchmarks and human judgments, but identifies notable anticorrelations for Adversarial Dishonesty, Adversarial Harmfulness, and Safety, along with uncorrelated signals for Language Assistance and Open QA; these patterns are analyzed via a low-rank structure of the 160×55 correlation matrix and a decomposition $C = U \Sigma V^T$ with three nonzero singular values. The authors demonstrate that overparameterized linear regressions can predict average human evaluation scores from NLP benchmarks with leave-one-out cross-validation, suggesting benchmarks can forecast user satisfaction across model scales, albeit with caveats due to sample size and linearity assumptions. Overall, the work reinforces the value of classic NLP benchmarks for gauging real-world user experience while highlighting gaps in safety/open-ended tasks and charting a path for leveraging benchmarks to reduce expensive human annotation.

Abstract

The explosion of high-performing conversational language models (LMs) has spurred a shift from classic natural language processing (NLP) benchmarks to expensive, time-consuming and noisy human evaluations - yet the relationship between these two evaluation strategies remains hazy. In this paper, we conduct a large-scale study of four Chat Llama 2 models, comparing their performance on 160 standard NLP benchmarks (e.g., MMLU, ARC, BIG-Bench Hard) against extensive human preferences on more than 11k single-turn and 2k multi-turn dialogues from over 2k human annotators. Our findings are striking: most NLP benchmarks strongly correlate with human evaluations, suggesting that cheaper, automated metrics can serve as surprisingly reliable predictors of human preferences. Three human evaluations, such as adversarial dishonesty and safety, are anticorrelated with NLP benchmarks, while two are uncorrelated. Moreover, through overparameterized linear regressions, we show that NLP scores can accurately predict human evaluations across different model scales, offering a path to reduce costly human annotation without sacrificing rigor. Overall, our results affirm the continued value of classic benchmarks and illuminate how to harness them to anticipate real-world user satisfaction - pointing to how NLP benchmarks can be leveraged to meet evaluation needs of our new era of conversational AI.

Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks

TL;DR

The paper investigates how automated NLP benchmarks relate to human evaluations for chat LMs, using four Llama 2 variants across and to assess correlations and predictive power. It finds strong overall correlations between benchmarks and human judgments, but identifies notable anticorrelations for Adversarial Dishonesty, Adversarial Harmfulness, and Safety, along with uncorrelated signals for Language Assistance and Open QA; these patterns are analyzed via a low-rank structure of the 160×55 correlation matrix and a decomposition with three nonzero singular values. The authors demonstrate that overparameterized linear regressions can predict average human evaluation scores from NLP benchmarks with leave-one-out cross-validation, suggesting benchmarks can forecast user satisfaction across model scales, albeit with caveats due to sample size and linearity assumptions. Overall, the work reinforces the value of classic NLP benchmarks for gauging real-world user experience while highlighting gaps in safety/open-ended tasks and charting a path for leveraging benchmarks to reduce expensive human annotation.

Abstract

The explosion of high-performing conversational language models (LMs) has spurred a shift from classic natural language processing (NLP) benchmarks to expensive, time-consuming and noisy human evaluations - yet the relationship between these two evaluation strategies remains hazy. In this paper, we conduct a large-scale study of four Chat Llama 2 models, comparing their performance on 160 standard NLP benchmarks (e.g., MMLU, ARC, BIG-Bench Hard) against extensive human preferences on more than 11k single-turn and 2k multi-turn dialogues from over 2k human annotators. Our findings are striking: most NLP benchmarks strongly correlate with human evaluations, suggesting that cheaper, automated metrics can serve as surprisingly reliable predictors of human preferences. Three human evaluations, such as adversarial dishonesty and safety, are anticorrelated with NLP benchmarks, while two are uncorrelated. Moreover, through overparameterized linear regressions, we show that NLP scores can accurately predict human evaluations across different model scales, offering a path to reduce costly human annotation without sacrificing rigor. Overall, our results affirm the continued value of classic benchmarks and illuminate how to harness them to anticipate real-world user satisfaction - pointing to how NLP benchmarks can be leveraged to meet evaluation needs of our new era of conversational AI.

Paper Structure

This paper contains 18 sections, 1 equation, 18 figures.

Figures (18)

  • Figure 1: Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing (NLP) Benchmarks. We evaluate chat language models on conversational tasks with human pairwise evaluations and on standard NLP benchmarks with automated metrics, then study whether scores on computationally inexpensive and fast NLP benchmarks are correlated with and predictive of expensive and time-intensive human evaluations.
  • Figure 2: Distributions of Correlations between Human Evaluations and NLP benchmarks. Macroscopically, for each human evaluation area, Chat LM scores are typically highly correlated with NLP benchmarks. Mesoscopically, human and NLP benchmarks remain positively correlated, with notable exceptions: Adversarial Dishonesty, Adversarial Harmfulness and Safety are anticorrelated with most NLP benchmarks, and Language Assistance and Open QA are uncorrelated.
  • Figure 3: NLP Benchmarks Ranked by Average Pearson Correlation over All Human Evaluations. Certain benchmarks have higher correlations with human evaluations, including a subset of MMLU, a subset of BIG Bench Hard, HellaSwag, ARC, RACE, PIQA, NaturalQuestions, QuAC, and CommonSenseQA. Other benchmarks were weakly or uncorrelated with human evaluations: ETHOS, Kth Sentence, Inverse Scaling (with the exception of Resisting Correction Classification), OpenBookQA, COPA, SciBench and SIQA.
  • Figure 4: Pearson Correlations Between Human Evaluations and NLP Benchmarks. Rows: Human evaluation areas-categories-subcategories. Columns: NLP benchmarks. The heatmap is row-wrapped to fit on the page. Large positive correlations (+1) are shown in red.Large negative anticorrelations (-1) are shown in blue. Low uncorrelations ($\sim$0) are shown in light-white-gray.
  • Figure 5: Matrix Decomposition of Pairwise Pearson Correlations Between Human Evaluations and NLP Benchmarks. The correlation matrix has 3 non-zero singular values (App. Fig. \ref{['app:fig:academic_human_singular_value_spectra']}). Bottom: Human evaluations and NLP benchmarks are plotted projected along the (dimension-scaled) first two singular modes of the Pearson correlation matrix. The bulk of evaluations live in one community (left), with smaller communities (top, bottom, right); for an in-depth interpretation, see Sec. \ref{['sec:correlations:subsec:community_detection']}.
  • ...and 13 more figures