Table of Contents
Fetching ...

Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation

Bogdan Kostić, Conor Fallon, Julian Risch, Alexander Löser

TL;DR

This work probes the robustness of LLM evaluation by applying meaning-preserving lexical and syntactic perturbations to three benchmarks (MMLU, SQuAD, AMEGA) across 23 models. Using two linguistically principled pipelines, the study uncovers that lexical changes produce substantial performance drops while syntactic changes yield more variable effects, and that leaderboards are brittle under these perturbations. Crucially, model size does not guarantee robustness; the relationship between scale and stability is task-dependent. The findings emphasize that LLM evaluations should incorporate robustness testing to avoid overestimating generalization and to guide more reliable benchmark design.

Abstract

The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison. Yet, their reliability is increasingly questioned due to sensitivity to shallow variations in input prompts. This paper examines how controlled, truth-conditionally equivalent lexical and syntactic perturbations affect the absolute performance and relative ranking of 23 contemporary LLMs across three benchmarks: MMLU, SQuAD, and AMEGA. We employ two linguistically principled pipelines to generate meaning-preserving variations: one performing synonym substitution for lexical changes, and another using dependency parsing to determine applicable syntactic transformations. Results show that lexical perturbations consistently induce substantial, statistically significant performance degradation across nearly all models and tasks, while syntactic perturbations have more heterogeneous effects, occasionally improving results. Both perturbation types destabilize model leaderboards on complex tasks. Furthermore, model robustness did not consistently scale with model size, revealing strong task dependence. Overall, the findings suggest that LLMs rely more on surface-level lexical patterns than on abstract linguistic competence, underscoring the need for robustness testing as a standard component of LLM evaluation.

Same Meaning, Different Scores: Lexical and Syntactic Sensitivity in LLM Evaluation

TL;DR

This work probes the robustness of LLM evaluation by applying meaning-preserving lexical and syntactic perturbations to three benchmarks (MMLU, SQuAD, AMEGA) across 23 models. Using two linguistically principled pipelines, the study uncovers that lexical changes produce substantial performance drops while syntactic changes yield more variable effects, and that leaderboards are brittle under these perturbations. Crucially, model size does not guarantee robustness; the relationship between scale and stability is task-dependent. The findings emphasize that LLM evaluations should incorporate robustness testing to avoid overestimating generalization and to guide more reliable benchmark design.

Abstract

The rapid advancement of Large Language Models (LLMs) has established standardized evaluation benchmarks as the primary instrument for model comparison. Yet, their reliability is increasingly questioned due to sensitivity to shallow variations in input prompts. This paper examines how controlled, truth-conditionally equivalent lexical and syntactic perturbations affect the absolute performance and relative ranking of 23 contemporary LLMs across three benchmarks: MMLU, SQuAD, and AMEGA. We employ two linguistically principled pipelines to generate meaning-preserving variations: one performing synonym substitution for lexical changes, and another using dependency parsing to determine applicable syntactic transformations. Results show that lexical perturbations consistently induce substantial, statistically significant performance degradation across nearly all models and tasks, while syntactic perturbations have more heterogeneous effects, occasionally improving results. Both perturbation types destabilize model leaderboards on complex tasks. Furthermore, model robustness did not consistently scale with model size, revealing strong task dependence. Overall, the findings suggest that LLMs rely more on surface-level lexical patterns than on abstract linguistic competence, underscoring the need for robustness testing as a standard component of LLM evaluation.
Paper Structure (19 sections, 5 figures, 4 tables)

This paper contains 19 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Illustration of the two linguistically principled, meaning-preserving perturbation pipelines and their average impact on LLM performance on the MMLU benchmark. Yellow and blue annotations mark changed words and moved constituents, respectively. The bar chart quantifies one of the core findings: lexical perturbations induce a substantial average accuracy drop, while the impact of syntactic perturbations is smaller.
  • Figure 2: Example from MMLU: original item (left), lexically perturbed version (center), and syntactically perturbed version (right). Changed words are marked in yellow and moved constituents in blue.
  • Figure 3: Average drop in performance after lexical and syntactic perturbation across 23 LLMs for MMLU, SQuAD, and AMEGA. Lexical perturbations cause larger drops, most notably on MMLU.
  • Figure 4: Model performance rankings before and after lexical perturbation (top) and syntactic perturbation (bottom) for (a) MMLU, (b) SQuAD, and (c) AMEGA. Rankings are largely preserved on MMLU, while SQuAD and AMEGA show noticeably more movement.
  • Figure 5: Correlation between log-transformed model size and performance drop on lexically and syntactically perturbed benchmarks (MMLU, SQuAD, and AMEGA). The dashed line in each plot illustrates the Ordinary Least Squares regression fit for the points. Model size correlates positively with performance drop on MMLU and negatively on SQuAD and AMEGA.