Table of Contents
Fetching ...

Line Goes Up? Inherent Limitations of Benchmarks for Evaluating Large Language Models

James Fodor

TL;DR

The paper critiques the use of standard benchmarks as proxies for general cognitive competence in large language models, arguing that over-fitting, lack of real-world relevance, and data contamination undermine their validity. It surveys empirical evidence from adversarial stimuli and interpretability studies showing that LLMs struggle to learn robust task structure and generalise beyond training data, often relying on superficial heuristics. The analysis of newer benchmarks reveals persistent contamination and limited evidence that benchmark success translates to real-world capabilities, while claims about 'reasoning models' remain unconvincing due to reliance on heuristics rather than true reasoning. The work advocates for adversarially informed, interpretability-centered evaluation to properly assess LLM generalisation and cautions against equating benchmark gains with genuine cognitive progress.

Abstract

Large language models (LLMs) regularly demonstrate new and impressive performance on a wide range of language, knowledge, and reasoning benchmarks. Such rapid progress has led many commentators to argue that LLM general cognitive capabilities have likewise rapidly improved, with the implication that such models are becoming progressively more capable on various real-world tasks. Here I summarise theoretical and empirical considerations to challenge this narrative. I argue that inherent limitations with the benchmarking paradigm, along with specific limitations of existing benchmarks, render benchmark performance highly unsuitable as a metric for generalisable competence over cognitive tasks. I also contend that alternative methods for assessing LLM capabilities, including adversarial stimuli and interpretability techniques, have shown that LLMs do not have robust competence in many language and reasoning tasks, and often fail to learn representations which facilitate generalisable inferences. I conclude that benchmark performance should not be used as a reliable indicator of general LLM cognitive capabilities.

Line Goes Up? Inherent Limitations of Benchmarks for Evaluating Large Language Models

TL;DR

The paper critiques the use of standard benchmarks as proxies for general cognitive competence in large language models, arguing that over-fitting, lack of real-world relevance, and data contamination undermine their validity. It surveys empirical evidence from adversarial stimuli and interpretability studies showing that LLMs struggle to learn robust task structure and generalise beyond training data, often relying on superficial heuristics. The analysis of newer benchmarks reveals persistent contamination and limited evidence that benchmark success translates to real-world capabilities, while claims about 'reasoning models' remain unconvincing due to reliance on heuristics rather than true reasoning. The work advocates for adversarially informed, interpretability-centered evaluation to properly assess LLM generalisation and cautions against equating benchmark gains with genuine cognitive progress.

Abstract

Large language models (LLMs) regularly demonstrate new and impressive performance on a wide range of language, knowledge, and reasoning benchmarks. Such rapid progress has led many commentators to argue that LLM general cognitive capabilities have likewise rapidly improved, with the implication that such models are becoming progressively more capable on various real-world tasks. Here I summarise theoretical and empirical considerations to challenge this narrative. I argue that inherent limitations with the benchmarking paradigm, along with specific limitations of existing benchmarks, render benchmark performance highly unsuitable as a metric for generalisable competence over cognitive tasks. I also contend that alternative methods for assessing LLM capabilities, including adversarial stimuli and interpretability techniques, have shown that LLMs do not have robust competence in many language and reasoning tasks, and often fail to learn representations which facilitate generalisable inferences. I conclude that benchmark performance should not be used as a reliable indicator of general LLM cognitive capabilities.

Paper Structure

This paper contains 10 sections.