Table of Contents
Fetching ...

On the Brittleness of LLMs: A Journey around Set Membership

Lea Hergert, Gábor Berend, Mario Szegedy, Gyorgy Turan, Márk Jelasity

TL;DR

This paper probes the brittleness of instruction-tuned LLMs by grounding evaluation in the simplest form of reasoning: set membership. It implements a large-scale, controlled study across diverse prompts, set structures, and models to map error patterns and reveal how prompts, ordering, and semantic relations distort reasoning. Key findings include prompt sensitivity, permutation sensitivity, and semantic leakage/boosting, with substantial variation in behavior across models. The work offers a scalable benchmark methodology for LLM evaluation and highlights fundamental gaps in LLMs' grasp of basic set concepts, informing future robustness and interpretability research.

Abstract

Large language models (LLMs) achieve superhuman performance on complex reasoning tasks, yet often fail on much simpler problems, raising concerns about their reliability and interpretability. We investigate this paradox through a focused study with two key design features: simplicity, to expose basic failure modes, and scale, to enable comprehensive controlled experiments. We focus on set membership queries -- among the most fundamental forms of reasoning -- using tasks like ``Is apple an element of the set \{pear, plum, apple, raspberry\}?''. We conduct a systematic empirical evaluation across prompt phrasing, semantic structure, element ordering, and model choice. Our large-scale analysis reveals that LLM performance on this elementary task is consistently brittle, and unpredictable across all dimensions, suggesting that the models' ``understanding'' of the set concept is fragmented and convoluted at best. Our work demonstrates that the large-scale experiments enabled by the simplicity of the problem allow us to map and analyze the failure modes comprehensively, making this approach a valuable methodology for LLM evaluation in general.

On the Brittleness of LLMs: A Journey around Set Membership

TL;DR

This paper probes the brittleness of instruction-tuned LLMs by grounding evaluation in the simplest form of reasoning: set membership. It implements a large-scale, controlled study across diverse prompts, set structures, and models to map error patterns and reveal how prompts, ordering, and semantic relations distort reasoning. Key findings include prompt sensitivity, permutation sensitivity, and semantic leakage/boosting, with substantial variation in behavior across models. The work offers a scalable benchmark methodology for LLM evaluation and highlights fundamental gaps in LLMs' grasp of basic set concepts, informing future robustness and interpretability research.

Abstract

Large language models (LLMs) achieve superhuman performance on complex reasoning tasks, yet often fail on much simpler problems, raising concerns about their reliability and interpretability. We investigate this paradox through a focused study with two key design features: simplicity, to expose basic failure modes, and scale, to enable comprehensive controlled experiments. We focus on set membership queries -- among the most fundamental forms of reasoning -- using tasks like ``Is apple an element of the set \{pear, plum, apple, raspberry\}?''. We conduct a systematic empirical evaluation across prompt phrasing, semantic structure, element ordering, and model choice. Our large-scale analysis reveals that LLM performance on this elementary task is consistently brittle, and unpredictable across all dimensions, suggesting that the models' ``understanding'' of the set concept is fragmented and convoluted at best. Our work demonstrates that the large-scale experiments enabled by the simplicity of the problem allow us to map and analyze the failure modes comprehensively, making this approach a valuable methodology for LLM evaluation in general.

Paper Structure

This paper contains 17 sections, 3 equations, 16 figures, 2 tables.

Figures (16)

  • Figure 1: Cosine similarity between the average query error-rates of prompt categories for all the LLMs (top row). The bottom row shows the cosine similarity between sub-classes of NL1 and NL2, where the division is based on the arrangement feature (see \ref{['sec:prompts']}). NL1.1 and NL2.1 use the element-first, the others use the set-first arrangement.
  • Figure 2: Top 6 Shapley values among binary features of the NL1 prompt category. The features are defined in \ref{['sec:prompts']}, non-binary features are converted to one-hot representation. A dot corresponds to a prompt template and the color red indicates that the feature is present.
  • Figure 3: The fraction of order-independent queries where the answer of the LLM is not consistently correct for all the permutations of the set, by prompt type and set type. Values higher than 0.0 indicate the presence of sensitivity to ordering. The black region of a bar corresponds to the fraction of consistently incorrect order-independent queries.
  • Figure 4: The average accuracy of queries with different set-types, for all the LLMs, in the NL1 (top) and CS (bottom) prompt category. Every point belongs to a fixed prompt template and represents the average of the queries that belong to the given set type. For example, for the complete set type, every point is the average accuracy of $10\cdot (24\cdot 4+6\cdot 4+6\cdot 4)$ prompts.
  • Figure 5: Scatter plots illustrating the relationship of unrelated and related word sets in the NL1 prompt category (top) and number sets in the CS prompt category (bottom). A point belongs to the same prompt template, the two coordinates are average accuracies over queries with related and unrelated sets of the given type.
  • ...and 11 more figures