Auxiliary task demands mask the capabilities of smaller language models

Jennifer Hu; Michael C. Frank

Auxiliary task demands mask the capabilities of smaller language models

Jennifer Hu, Michael C. Frank

TL;DR

Task demands confound inferences about latent capacities in both humans and LM evaluations. We define and quantify a signature 'demand gap' ($\Delta$) as the difference in accuracy between high-demand and low-demand evaluation settings. Across analogical reasoning, reflective reasoning, word prediction, and grammaticality judgments, the magnitude of $\Delta$ grows for smaller models and shrinks with increased size or longer training. The results suggest that performance reflects evaluation design and resource availability rather than stationary intelligence, underscoring the need for theory-driven evaluation choices in AI research and applications.

Abstract

Developmental psychologists have argued about when cognitive capacities such as language understanding or theory of mind emerge. These debates often hinge on the concept of "task demands" -- the auxiliary challenges associated with performing a particular evaluation -- that may mask the child's underlying ability. The same issues arise when measuring the capacities of language models (LMs): performance on a task is a function of the model's underlying knowledge, combined with the model's ability to interpret and perform the task given its available resources. Here, we show that for analogical reasoning, reflective reasoning, word prediction, and grammaticality judgments, evaluation methods with greater task demands yield lower performance than evaluations with reduced demands. This "demand gap" is most pronounced for models with fewer parameters and less training data. Our results illustrate that LM performance should not be interpreted as a direct indication of intelligence (or lack thereof), but as a reflection of capacities seen through the lens of researchers' design choices.

Auxiliary task demands mask the capabilities of smaller language models

TL;DR

Task demands confound inferences about latent capacities in both humans and LM evaluations. We define and quantify a signature 'demand gap' (

) as the difference in accuracy between high-demand and low-demand evaluation settings. Across analogical reasoning, reflective reasoning, word prediction, and grammaticality judgments, the magnitude of

grows for smaller models and shrinks with increased size or longer training. The results suggest that performance reflects evaluation design and resource availability rather than stationary intelligence, underscoring the need for theory-driven evaluation choices in AI research and applications.

Abstract

Paper Structure (23 sections, 3 equations, 3 figures, 2 tables)

This paper contains 23 sections, 3 equations, 3 figures, 2 tables.

Introduction
Background and related work
Methods
Evaluation contrasts
Production vs. forced choice.
Metalinguistic judgment vs. probability measurement.
Cognitive domains
Production vs. forced choice
Analogical reasoning.
Reflective reasoning.
Metalinguistic judgment vs. probability measurement
Word prediction.
Grammaticality judgment.
Models
Model size.
...and 8 more sections

Figures (3)

Figure 1: A: Hypothetical task demands in two evaluation settings, faced by humans and machines. Both methods apparently measure the accuracy of target word prediction, but the high demand setting imposes additional auxiliary demands. B: Hypothesized pattern of results if task demands asymmetrically affect less capable agents (e.g., younger children or smaller models). C: Signature "demand gap" produced by hypothesized pattern in B.
Figure 2: Production (high task demands) vs. forced choice (low task demands) in two domains: analogical reasoning (top row) and reflective reasoning (bottom row). A,C: Accuracy scores across models and evaluation methods. B,D: Difference of log odds (forced choice $-$ production). Colored lines = best-fit within model families. Black line = best-fit across all models. Shaded region indicates bootstrapped 95% CI. (Log odds difference for Pythia 1B in panel D is infinite, so it is not shown.)
Figure 3: Metalinguistic judgment (high task demands) vs. direct probability measurement (low task demands) in two domains: word prediction (top row) and grammaticality judgments (bottom row). A: Log probability assigned to final word in word prediction domain. B: Difference of final-word log probability (direct $-$ metalinguistic). C: Accuracy in gramaticality judgment domain. D: Difference of log odds (direct $-$ metalinguistic). Colored lines = best-fit within model families. Black line = best-fit across all models. Shaded region indicates bootstrapped 95% CI.

Auxiliary task demands mask the capabilities of smaller language models

TL;DR

Abstract

Auxiliary task demands mask the capabilities of smaller language models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)