HellaSwag-Pro: A Large-Scale Bilingual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning

Xiaoyuan Li; Moxin Li; Rui Men; Yichang Zhang; Keqin Bao; Wenjie Wang; Fuli Feng; Dayiheng Liu; Junyang Lin

HellaSwag-Pro: A Large-Scale Bilingual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning

Xiaoyuan Li, Moxin Li, Rui Men, Yichang Zhang, Keqin Bao, Wenjie Wang, Fuli Feng, Dayiheng Liu, Junyang Lin

TL;DR

This work tackles whether large language models truly understand commonsense reasoning by evaluating their robustness to variant question forms across languages. It introduces HellaSwag-Pro, a large-scale bilingual benchmark with seven variant types (11,200 items) and a Chinese counterpart to enable cross-lingual robustness analysis. Through extensive experiments on 41 LLMs with nine prompting strategies, the authors find widespread robustness gaps, with performance degrades on variants and strong dependence on language and prompting style; chain-of-thought reasoning and few-shot demonstrations can mitigate some weaknesses. The dataset, methodology, and findings offer practical guidance for evaluating and advancing commonsense reasoning in LLMs and highlight areas where current models still fall short of genuine understanding.

Abstract

Large language models (LLMs) have shown remarkable capabilities in commonsense reasoning; however, some variations in questions can trigger incorrect responses. Do these models truly understand commonsense knowledge, or just memorize expression patterns? To investigate this question, we present the first extensive robustness evaluation of LLMs in commonsense reasoning. We introduce HellaSwag-Pro, a large-scale bilingual benchmark consisting of 11,200 cases, by designing and compiling seven types of question variants. To construct this benchmark, we propose a two-stage method to develop Chinese HellaSwag, a finely annotated dataset comprising 12,000 instances across 56 categories. We conduct extensive experiments on 41 representative LLMs, revealing that these LLMs are far from robust in commonsense reasoning. Furthermore, this robustness varies depending on the language in which the LLM is tested. This work establishes a high-quality evaluation benchmark, with extensive experiments offering valuable insights to the community in commonsense reasoning for LLMs.

HellaSwag-Pro: A Large-Scale Bilingual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning

TL;DR

Abstract

HellaSwag-Pro: A Large-Scale Bilingual Benchmark for Evaluating the Robustness of LLMs in Commonsense Reasoning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)