Table of Contents
Fetching ...

HellaSwag-Pro: A Large-Scale Bilingual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning

Xiaoyuan Li, Moxin Li, Rui Men, Yichang Zhang, Keqin Bao, Wenjie Wang, Fuli Feng, Dayiheng Liu, Junyang Lin

TL;DR

This work tackles whether large language models truly understand commonsense reasoning by evaluating their robustness to variant question forms across languages. It introduces HellaSwag-Pro, a large-scale bilingual benchmark with seven variant types (11,200 items) and a Chinese counterpart to enable cross-lingual robustness analysis. Through extensive experiments on 41 LLMs with nine prompting strategies, the authors find widespread robustness gaps, with performance degrades on variants and strong dependence on language and prompting style; chain-of-thought reasoning and few-shot demonstrations can mitigate some weaknesses. The dataset, methodology, and findings offer practical guidance for evaluating and advancing commonsense reasoning in LLMs and highlight areas where current models still fall short of genuine understanding.

Abstract

Large language models (LLMs) have shown remarkable capabilities in commonsense reasoning; however, some variations in questions can trigger incorrect responses. Do these models truly understand commonsense knowledge, or just memorize expression patterns? To investigate this question, we present the first extensive robustness evaluation of LLMs in commonsense reasoning. We introduce HellaSwag-Pro, a large-scale bilingual benchmark consisting of 11,200 cases, by designing and compiling seven types of question variants. To construct this benchmark, we propose a two-stage method to develop Chinese HellaSwag, a finely annotated dataset comprising 12,000 instances across 56 categories. We conduct extensive experiments on 41 representative LLMs, revealing that these LLMs are far from robust in commonsense reasoning. Furthermore, this robustness varies depending on the language in which the LLM is tested. This work establishes a high-quality evaluation benchmark, with extensive experiments offering valuable insights to the community in commonsense reasoning for LLMs.

HellaSwag-Pro: A Large-Scale Bilingual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning

TL;DR

This work tackles whether large language models truly understand commonsense reasoning by evaluating their robustness to variant question forms across languages. It introduces HellaSwag-Pro, a large-scale bilingual benchmark with seven variant types (11,200 items) and a Chinese counterpart to enable cross-lingual robustness analysis. Through extensive experiments on 41 LLMs with nine prompting strategies, the authors find widespread robustness gaps, with performance degrades on variants and strong dependence on language and prompting style; chain-of-thought reasoning and few-shot demonstrations can mitigate some weaknesses. The dataset, methodology, and findings offer practical guidance for evaluating and advancing commonsense reasoning in LLMs and highlight areas where current models still fall short of genuine understanding.

Abstract

Large language models (LLMs) have shown remarkable capabilities in commonsense reasoning; however, some variations in questions can trigger incorrect responses. Do these models truly understand commonsense knowledge, or just memorize expression patterns? To investigate this question, we present the first extensive robustness evaluation of LLMs in commonsense reasoning. We introduce HellaSwag-Pro, a large-scale bilingual benchmark consisting of 11,200 cases, by designing and compiling seven types of question variants. To construct this benchmark, we propose a two-stage method to develop Chinese HellaSwag, a finely annotated dataset comprising 12,000 instances across 56 categories. We conduct extensive experiments on 41 representative LLMs, revealing that these LLMs are far from robust in commonsense reasoning. Furthermore, this robustness varies depending on the language in which the LLM is tested. This work establishes a high-quality evaluation benchmark, with extensive experiments offering valuable insights to the community in commonsense reasoning for LLMs.

Paper Structure

This paper contains 36 sections, 4 equations, 11 figures, 20 tables.

Figures (11)

  • Figure 1: Comparison of GPT-4o's responses to an original question and its several meaning-preserving variants. GPT-4o successfully handles the original question but struggles with its variants on the same knowledge.
  • Figure 2: The two-stage data construction pipeline for Chinese HellaSwag. See an example in Table \ref{['case']}.
  • Figure 3: Overview of Chinese HellaSwag categories. There are seven broad categories in total, each with eight detailed subcategories.
  • Figure 4: Pairwise performance statistics of the original question and its variant. We use "HellaSwag ✓ HellaSwag-Pro ✗" to denote that the LLM correctly answers the original question but fails on its variant.
  • Figure 5: Each variant's contribution to the RLA score.
  • ...and 6 more figures