Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks
Rylan Schaeffer, Punit Singh Koura, Binh Tang, Ranjan Subramanian, Aaditya K Singh, Todor Mihaylov, Prajjwal Bhargava, Lovish Madaan, Niladri S. Chatterji, Vedanuj Goswami, Sergey Edunov, Dieuwke Hupkes, Sanmi Koyejo, Sharan Narang
TL;DR
The paper investigates how automated NLP benchmarks relate to human evaluations for chat LMs, using four Llama 2 variants across $X_{\text{NLP}} \in \mathbb{R}^{160 \times 4}$ and $X_{\text{Human}} \in \mathbb{R}^{55 \times 4}$ to assess correlations and predictive power. It finds strong overall correlations between benchmarks and human judgments, but identifies notable anticorrelations for Adversarial Dishonesty, Adversarial Harmfulness, and Safety, along with uncorrelated signals for Language Assistance and Open QA; these patterns are analyzed via a low-rank structure of the 160×55 correlation matrix and a decomposition $C = U \Sigma V^T$ with three nonzero singular values. The authors demonstrate that overparameterized linear regressions can predict average human evaluation scores from NLP benchmarks with leave-one-out cross-validation, suggesting benchmarks can forecast user satisfaction across model scales, albeit with caveats due to sample size and linearity assumptions. Overall, the work reinforces the value of classic NLP benchmarks for gauging real-world user experience while highlighting gaps in safety/open-ended tasks and charting a path for leveraging benchmarks to reduce expensive human annotation.
Abstract
The explosion of high-performing conversational language models (LMs) has spurred a shift from classic natural language processing (NLP) benchmarks to expensive, time-consuming and noisy human evaluations - yet the relationship between these two evaluation strategies remains hazy. In this paper, we conduct a large-scale study of four Chat Llama 2 models, comparing their performance on 160 standard NLP benchmarks (e.g., MMLU, ARC, BIG-Bench Hard) against extensive human preferences on more than 11k single-turn and 2k multi-turn dialogues from over 2k human annotators. Our findings are striking: most NLP benchmarks strongly correlate with human evaluations, suggesting that cheaper, automated metrics can serve as surprisingly reliable predictors of human preferences. Three human evaluations, such as adversarial dishonesty and safety, are anticorrelated with NLP benchmarks, while two are uncorrelated. Moreover, through overparameterized linear regressions, we show that NLP scores can accurately predict human evaluations across different model scales, offering a path to reduce costly human annotation without sacrificing rigor. Overall, our results affirm the continued value of classic benchmarks and illuminate how to harness them to anticipate real-world user satisfaction - pointing to how NLP benchmarks can be leveraged to meet evaluation needs of our new era of conversational AI.
