On the Brittleness of LLMs: A Journey around Set Membership
Lea Hergert, Gábor Berend, Mario Szegedy, Gyorgy Turan, Márk Jelasity
TL;DR
This paper probes the brittleness of instruction-tuned LLMs by grounding evaluation in the simplest form of reasoning: set membership. It implements a large-scale, controlled study across diverse prompts, set structures, and models to map error patterns and reveal how prompts, ordering, and semantic relations distort reasoning. Key findings include prompt sensitivity, permutation sensitivity, and semantic leakage/boosting, with substantial variation in behavior across models. The work offers a scalable benchmark methodology for LLM evaluation and highlights fundamental gaps in LLMs' grasp of basic set concepts, informing future robustness and interpretability research.
Abstract
Large language models (LLMs) achieve superhuman performance on complex reasoning tasks, yet often fail on much simpler problems, raising concerns about their reliability and interpretability. We investigate this paradox through a focused study with two key design features: simplicity, to expose basic failure modes, and scale, to enable comprehensive controlled experiments. We focus on set membership queries -- among the most fundamental forms of reasoning -- using tasks like ``Is apple an element of the set \{pear, plum, apple, raspberry\}?''. We conduct a systematic empirical evaluation across prompt phrasing, semantic structure, element ordering, and model choice. Our large-scale analysis reveals that LLM performance on this elementary task is consistently brittle, and unpredictable across all dimensions, suggesting that the models' ``understanding'' of the set concept is fragmented and convoluted at best. Our work demonstrates that the large-scale experiments enabled by the simplicity of the problem allow us to map and analyze the failure modes comprehensively, making this approach a valuable methodology for LLM evaluation in general.
