RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models

Yuqing Wang; Yun Zhao

RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models

Yuqing Wang, Yun Zhao

TL;DR

RUPBench addresses the critical need to evaluate LLM robustness across diverse reasoning tasks under realistic input variations. It assembles 15 reasoning datasets spanning commonsense, arithmetic, logical, and knowledge-intensive domains and subjects them to nine perturbation types across lexical, syntactic, and semantic dimensions, totaling 365,580 perturbed samples. The study evaluates GPT-4o, Llama3, Phi-3, and Gemma models, revealing that larger models generally exhibit stronger robustness while identifying common error patterns such as context misinterpretation and knowledge gaps. The benchmark provides granular insights into perturbation sensitivity, guiding targeted improvements for reliable real-world LLM deployments and offering a scalable framework for future robustness research.

Abstract

With the increasing use of large language models (LLMs), ensuring reliable performance in diverse, real-world environments is essential. Despite their remarkable achievements, LLMs often struggle with adversarial inputs, significantly impacting their effectiveness in practical applications. To systematically understand the robustness of LLMs, we present RUPBench, a comprehensive benchmark designed to evaluate LLM robustness across diverse reasoning tasks. Our benchmark incorporates 15 reasoning datasets, categorized into commonsense, arithmetic, logical, and knowledge-intensive reasoning, and introduces nine types of textual perturbations at lexical, syntactic, and semantic levels. By examining the performance of state-of-the-art LLMs such as GPT-4o, Llama3, Phi-3, and Gemma on both original and perturbed datasets, we provide a detailed analysis of their robustness and error patterns. Our findings highlight that larger models tend to exhibit greater robustness to perturbations. Additionally, common error types are identified through manual inspection, revealing specific challenges faced by LLMs in different reasoning contexts. This work provides insights into areas where LLMs need further improvement to handle diverse and noisy inputs effectively.

RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models

TL;DR

Abstract

Paper Structure (26 sections, 3 figures, 4 tables)

This paper contains 26 sections, 3 figures, 4 tables.

Introduction
Related Work
LLM Evaluation
Textual Perturbations and LLM Safety
Dataset Construction
Tasks and datasets
Commonsense Reasoning
Arithmetic Reasoning
Logical Reasoning
Knowledge-Intensive Reasoning
Perturbation Categories
Lexical Perturbation
Syntactic Perturbation
Semantic Perturbation
Expert Review
...and 11 more sections

Figures (3)

Figure 1: Overview of the data construction pipeline for RUPBench.
Figure 2: Normalized PDR (%) of nine perturbation types, averaged across datasets and models. Normalization scales each perturbation's impact.
Figure 3: Average PDR (%) by dataset categories and models. Each bar represents the average PDR for a specific model across different dataset categories. Commonsense reasoning and arithmetic reasoning are generally more susceptible to perturbations. Additionally, larger models tend to be more robust to perturbations.

RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models

TL;DR

Abstract

RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)