Table of Contents
Fetching ...

Auxiliary task demands mask the capabilities of smaller language models

Jennifer Hu, Michael C. Frank

TL;DR

Task demands confound inferences about latent capacities in both humans and LM evaluations. We define and quantify a signature 'demand gap' ($\Delta$) as the difference in accuracy between high-demand and low-demand evaluation settings. Across analogical reasoning, reflective reasoning, word prediction, and grammaticality judgments, the magnitude of $\Delta$ grows for smaller models and shrinks with increased size or longer training. The results suggest that performance reflects evaluation design and resource availability rather than stationary intelligence, underscoring the need for theory-driven evaluation choices in AI research and applications.

Abstract

Developmental psychologists have argued about when cognitive capacities such as language understanding or theory of mind emerge. These debates often hinge on the concept of "task demands" -- the auxiliary challenges associated with performing a particular evaluation -- that may mask the child's underlying ability. The same issues arise when measuring the capacities of language models (LMs): performance on a task is a function of the model's underlying knowledge, combined with the model's ability to interpret and perform the task given its available resources. Here, we show that for analogical reasoning, reflective reasoning, word prediction, and grammaticality judgments, evaluation methods with greater task demands yield lower performance than evaluations with reduced demands. This "demand gap" is most pronounced for models with fewer parameters and less training data. Our results illustrate that LM performance should not be interpreted as a direct indication of intelligence (or lack thereof), but as a reflection of capacities seen through the lens of researchers' design choices.

Auxiliary task demands mask the capabilities of smaller language models

TL;DR

Task demands confound inferences about latent capacities in both humans and LM evaluations. We define and quantify a signature 'demand gap' () as the difference in accuracy between high-demand and low-demand evaluation settings. Across analogical reasoning, reflective reasoning, word prediction, and grammaticality judgments, the magnitude of grows for smaller models and shrinks with increased size or longer training. The results suggest that performance reflects evaluation design and resource availability rather than stationary intelligence, underscoring the need for theory-driven evaluation choices in AI research and applications.

Abstract

Developmental psychologists have argued about when cognitive capacities such as language understanding or theory of mind emerge. These debates often hinge on the concept of "task demands" -- the auxiliary challenges associated with performing a particular evaluation -- that may mask the child's underlying ability. The same issues arise when measuring the capacities of language models (LMs): performance on a task is a function of the model's underlying knowledge, combined with the model's ability to interpret and perform the task given its available resources. Here, we show that for analogical reasoning, reflective reasoning, word prediction, and grammaticality judgments, evaluation methods with greater task demands yield lower performance than evaluations with reduced demands. This "demand gap" is most pronounced for models with fewer parameters and less training data. Our results illustrate that LM performance should not be interpreted as a direct indication of intelligence (or lack thereof), but as a reflection of capacities seen through the lens of researchers' design choices.
Paper Structure (23 sections, 3 equations, 3 figures, 2 tables)

This paper contains 23 sections, 3 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: A: Hypothetical task demands in two evaluation settings, faced by humans and machines. Both methods apparently measure the accuracy of target word prediction, but the high demand setting imposes additional auxiliary demands. B: Hypothesized pattern of results if task demands asymmetrically affect less capable agents (e.g., younger children or smaller models). C: Signature "demand gap" produced by hypothesized pattern in B.
  • Figure 2: Production (high task demands) vs. forced choice (low task demands) in two domains: analogical reasoning (top row) and reflective reasoning (bottom row). A,C: Accuracy scores across models and evaluation methods. B,D: Difference of log odds (forced choice $-$ production). Colored lines = best-fit within model families. Black line = best-fit across all models. Shaded region indicates bootstrapped 95% CI. (Log odds difference for Pythia 1B in panel D is infinite, so it is not shown.)
  • Figure 3: Metalinguistic judgment (high task demands) vs. direct probability measurement (low task demands) in two domains: word prediction (top row) and grammaticality judgments (bottom row). A: Log probability assigned to final word in word prediction domain. B: Difference of final-word log probability (direct $-$ metalinguistic). C: Accuracy in gramaticality judgment domain. D: Difference of log odds (direct $-$ metalinguistic). Colored lines = best-fit within model families. Black line = best-fit across all models. Shaded region indicates bootstrapped 95% CI.