Table of Contents
Fetching ...

What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks

Pavel Chizhov, Mattia Nee, Pierre-Carl Langlais, Ivan P. Yamshchikov

TL;DR

The paper questions whether HellaSwag validly measures common-sense reasoning by uncovering pervasive quality and design issues, including ungrammatical prompts, nonsensical or mutually implausible options, and a surprising insensitivity to the question context. It combines human annotations with model evaluations in log-likelihood and generation modes and employs zero-prompt tests and placeholder prompts to assess construct validity, revealing that many predictions rely on prompt-irrelevant cues rather than reasoning. The authors show that extended accuracy and zero-prompt performance undermine the claim that HellaSwag captures true common-sense ability, and they introduce GoldenSwag, a highly filtered subset, to enable more reliable evaluation. The work advocates a set of rigorous requirements for future benchmarks and provides GoldenSwag as a concrete step toward more trustworthy measurement of common-sense reasoning in language models.

Abstract

Common-sense reasoning is a key language model capability because it encapsulates not just specific factual knowledge but rather general language and world understanding. Measuring common-sense reasoning, therefore, is crucial for language models of different sizes and applications. One of the most widely used benchmarks for evaluating such capabilities is HellaSwag; however, in this paper, we show that it has severe construct validity issues. These issues range from basic ungrammaticality and numerous typos to misleading prompts or equally correct options. Furthermore, we show that if models are evaluated only on answer texts, or with "Lorem ipsum dolor..." instead of the question, more than 65% of model predictions remain the same, and this cannot be attributed merely to contamination. Since benchmark scores are an essential part of model selection in both research and commercial applications, these validity issues can have severe consequences. In particular, knowing that taking benchmark scores at face value is ubiquitous, inadequate evaluation leads to ill-informed decisions about models. In this paper, we thoroughly investigate critical validity issues posed by HellaSwag and illustrate them with various evaluations using generative language models of different sizes. We argue that this benchmark does not accurately measure common-sense reasoning and, therefore, should not be used for evaluation in its current state. Based on the results of our study, we propose requirements that should be met by future common-sense reasoning benchmarks. In addition, we release GoldenSwag, a corrected subset of HellaSwag, which, to our belief, facilitates acceptable common-sense reasoning evaluation.

What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks

TL;DR

The paper questions whether HellaSwag validly measures common-sense reasoning by uncovering pervasive quality and design issues, including ungrammatical prompts, nonsensical or mutually implausible options, and a surprising insensitivity to the question context. It combines human annotations with model evaluations in log-likelihood and generation modes and employs zero-prompt tests and placeholder prompts to assess construct validity, revealing that many predictions rely on prompt-irrelevant cues rather than reasoning. The authors show that extended accuracy and zero-prompt performance undermine the claim that HellaSwag captures true common-sense ability, and they introduce GoldenSwag, a highly filtered subset, to enable more reliable evaluation. The work advocates a set of rigorous requirements for future benchmarks and provides GoldenSwag as a concrete step toward more trustworthy measurement of common-sense reasoning in language models.

Abstract

Common-sense reasoning is a key language model capability because it encapsulates not just specific factual knowledge but rather general language and world understanding. Measuring common-sense reasoning, therefore, is crucial for language models of different sizes and applications. One of the most widely used benchmarks for evaluating such capabilities is HellaSwag; however, in this paper, we show that it has severe construct validity issues. These issues range from basic ungrammaticality and numerous typos to misleading prompts or equally correct options. Furthermore, we show that if models are evaluated only on answer texts, or with "Lorem ipsum dolor..." instead of the question, more than 65% of model predictions remain the same, and this cannot be attributed merely to contamination. Since benchmark scores are an essential part of model selection in both research and commercial applications, these validity issues can have severe consequences. In particular, knowing that taking benchmark scores at face value is ubiquitous, inadequate evaluation leads to ill-informed decisions about models. In this paper, we thoroughly investigate critical validity issues posed by HellaSwag and illustrate them with various evaluations using generative language models of different sizes. We argue that this benchmark does not accurately measure common-sense reasoning and, therefore, should not be used for evaluation in its current state. Based on the results of our study, we propose requirements that should be met by future common-sense reasoning benchmarks. In addition, we release GoldenSwag, a corrected subset of HellaSwag, which, to our belief, facilitates acceptable common-sense reasoning evaluation.

Paper Structure

This paper contains 23 sections, 3 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Example from the HellaSwag validation set: a question prompt and four answer options (the correct one is ticked). Validity issues are underlined in color. The part of the question prompt that we remove in zero-prompt experiments is highlighted in light gray.
  • Figure 2: Accuracy evaluation with log-likelihood for larger (32B) and smaller (1-2B) models with a full question prompt and without it (zero-prompt). For each model, we report three accuracy scores: usual accuracy on correct answers, extended accuracy on correct and equally correct options, and accuracy on the worst option. Important: the ground truths for the last two kinds of evaluation are based on the annotations from Claude.
  • Figure 3: Mean log-likelihoods of answer options and their lengths. The lines represent the trend for each question source.
  • Figure 4: Accuracy comparison between HellaSwag and GoldenSwag. All models were evaluated by maximizing mean log-likelihood in full- and zero-prompt scenarios.
  • Figure 5: Joint distributions of likelihoods and lengths of the HellaSwag validation set questions for different variants of likelihood computation. The lines are meant to show the trend, we are not assuming linear regression to be significant for these data.
  • ...and 2 more figures