Table of Contents
Fetching ...

On the Measure of a Model: From Intelligence to Generality

Ruchira Dhar, Ninell Oldenburg, Anders Soegaard

TL;DR

This work questions the use of intelligence as the central lens for evaluating AI systems, arguing that intelligence is ill-defined and benchmarks often fail to predict real-world utility. It formalizes three notions—generality, stability, and realism—and shows that only generality yields a robust, transferable evaluation framework, supported by multitask learning theory. The authors introduce the Generality Score (G-Score) to operationalize breadth and consistency across a diverse task set, and demonstrate how evaluating across many tasks reduces generalization error by a factor roughly proportional to $\sqrt{n}$, where $n$ is the number of tasks. The result is a practical reorientation of model assessment toward broad, reliable performance in open-ended environments, with implications for benchmarking, deployment, and future research on task distributions and evaluation design.

Abstract

Benchmarks such as ARC, Raven-inspired tests, and the Blackbird Task are widely used to evaluate the intelligence of large language models (LLMs). Yet, the concept of intelligence remains elusive- lacking a stable definition and failing to predict performance on practical tasks such as question answering, summarization, or coding. Optimizing for such benchmarks risks misaligning evaluation with real-world utility. Our perspective is that evaluation should be grounded in generality rather than abstract notions of intelligence. We identify three assumptions that often underpin intelligence-focused evaluation: generality, stability, and realism. Through conceptual and formal analysis, we show that only generality withstands conceptual and empirical scrutiny. Intelligence is not what enables generality; generality is best understood as a multitask learning problem that directly links evaluation to measurable performance breadth and reliability. This perspective reframes how progress in AI should be assessed and proposes generality as a more stable foundation for evaluating capability across diverse and evolving tasks.

On the Measure of a Model: From Intelligence to Generality

TL;DR

This work questions the use of intelligence as the central lens for evaluating AI systems, arguing that intelligence is ill-defined and benchmarks often fail to predict real-world utility. It formalizes three notions—generality, stability, and realism—and shows that only generality yields a robust, transferable evaluation framework, supported by multitask learning theory. The authors introduce the Generality Score (G-Score) to operationalize breadth and consistency across a diverse task set, and demonstrate how evaluating across many tasks reduces generalization error by a factor roughly proportional to , where is the number of tasks. The result is a practical reorientation of model assessment toward broad, reliable performance in open-ended environments, with implications for benchmarking, deployment, and future research on task distributions and evaluation design.

Abstract

Benchmarks such as ARC, Raven-inspired tests, and the Blackbird Task are widely used to evaluate the intelligence of large language models (LLMs). Yet, the concept of intelligence remains elusive- lacking a stable definition and failing to predict performance on practical tasks such as question answering, summarization, or coding. Optimizing for such benchmarks risks misaligning evaluation with real-world utility. Our perspective is that evaluation should be grounded in generality rather than abstract notions of intelligence. We identify three assumptions that often underpin intelligence-focused evaluation: generality, stability, and realism. Through conceptual and formal analysis, we show that only generality withstands conceptual and empirical scrutiny. Intelligence is not what enables generality; generality is best understood as a multitask learning problem that directly links evaluation to measurable performance breadth and reliability. This perspective reframes how progress in AI should be assessed and proposes generality as a more stable foundation for evaluating capability across diverse and evolving tasks.

Paper Structure

This paper contains 21 sections, 2 theorems, 23 equations, 2 figures.

Key Result

Theorem 1

Consider an environment $\mathcal{E}$ consisting of a distribution $Q$ over tasks, where each task $P \sim Q$ is a distribution over data in a learning problem. Let $\mathcal{H}$ be a hypothesis class, $L_P(h)$ be the loss on task $P$, and $L_Q(h)$ be the model's environment average error. Then, for

Figures (2)

  • Figure 1: Comparing performance of LLMs on AGI benchmarks ARC-AGI-1, ARC-AGI-2 and preference-based benchmark like LMArena.
  • Figure 2: The performance of LLMs on task-specific benchmarks OpenBookQA, Entity Extraction, and StackUnseen.

Theorems & Definitions (7)

  • Definition 3.1: Generality
  • Definition 3.2: Stability
  • Definition 3.3: Realism
  • Theorem 1
  • proof
  • Theorem 2
  • proof