Table of Contents
Fetching ...

Can We Count on LLMs? The Fixed-Effect Fallacy and Claims of GPT-4 Capabilities

Thomas Ball, Shuo Chen, Cormac Herley

TL;DR

The paper addresses the problem of evaluating LLM capabilities by showing that simple, deterministic tasks are highly sensitive to seemingly minor prompt and input variations, a phenomenon they frame as the language-as-fixed-effect fallacy. Using large-scale, parameterized tasks and rigorous statistical tests, they demonstrate that accuracy estimates for counting, sorting, and arithmetic tasks with GPT-4 can vary greatly across prompts and input populations, undermining generalization. Their key contributions include a formal demonstration of non-generalizability due to fixed-effect-like factors, open data for replication, and a call for revised margin-of-error notions in LLM evaluation. The work has practical impact by encouraging more robust, transparent evaluation methodologies and cautioning against overinterpreting single-task performance as evidence of broad capabilities.

Abstract

In this paper we explore evaluation of LLM capabilities. We present measurements of GPT-4 performance on several deterministic tasks; each task involves a basic calculation and takes as input parameter some element drawn from a large well-defined population (e.g., count elements in a list, multiply two k-digit numbers, etc). We examine several conditions per-task and perform enough trials so that statistically significant differences can be detected. This allows us to investigate the sensitivity of task-accuracy both to query phrasing and input parameter population. We find that seemingly trivial modifications in the task-prompt or input population can yield differences far larger than can be explained by sampling effects. For example, performance on a simple list-counting task varies with query-phrasing and list-length, but also with list composition (i.e., the thing-to-be-counted) and object frequency (e.g., success when an element accounts for $\approx$ 50\% of a list is different from when it accounts for $\approx$ 70\% etc). We conclude that efforts to quantify LLM capabilities easily succumb to the language-as-fixed-effect fallacy, where experimental observations are improperly generalized beyond what the data supports. A consequence appears to be that intuitions that have been formed based on interactions with humans form a very unreliable guide as to which input modifications should ``make no difference'' to LLM performance.

Can We Count on LLMs? The Fixed-Effect Fallacy and Claims of GPT-4 Capabilities

TL;DR

The paper addresses the problem of evaluating LLM capabilities by showing that simple, deterministic tasks are highly sensitive to seemingly minor prompt and input variations, a phenomenon they frame as the language-as-fixed-effect fallacy. Using large-scale, parameterized tasks and rigorous statistical tests, they demonstrate that accuracy estimates for counting, sorting, and arithmetic tasks with GPT-4 can vary greatly across prompts and input populations, undermining generalization. Their key contributions include a formal demonstration of non-generalizability due to fixed-effect-like factors, open data for replication, and a call for revised margin-of-error notions in LLM evaluation. The work has practical impact by encouraging more robust, transparent evaluation methodologies and cautioning against overinterpreting single-task performance as evidence of broad capabilities.

Abstract

In this paper we explore evaluation of LLM capabilities. We present measurements of GPT-4 performance on several deterministic tasks; each task involves a basic calculation and takes as input parameter some element drawn from a large well-defined population (e.g., count elements in a list, multiply two k-digit numbers, etc). We examine several conditions per-task and perform enough trials so that statistically significant differences can be detected. This allows us to investigate the sensitivity of task-accuracy both to query phrasing and input parameter population. We find that seemingly trivial modifications in the task-prompt or input population can yield differences far larger than can be explained by sampling effects. For example, performance on a simple list-counting task varies with query-phrasing and list-length, but also with list composition (i.e., the thing-to-be-counted) and object frequency (e.g., success when an element accounts for 50\% of a list is different from when it accounts for 70\% etc). We conclude that efforts to quantify LLM capabilities easily succumb to the language-as-fixed-effect fallacy, where experimental observations are improperly generalized beyond what the data supports. A consequence appears to be that intuitions that have been formed based on interactions with humans form a very unreliable guide as to which input modifications should ``make no difference'' to LLM performance.
Paper Structure (13 sections, 2 equations, 9 tables)