Table of Contents
Fetching ...

Brevity Constraints Reverse Performance Hierarchies in Language Models

MD Azizul Hakim

Abstract

Standard evaluation protocols reveal a counterintuitive phenomenon: on 7.7% of benchmark problems spanning five datasets, larger language models underperform smaller ones by 28.4 percentage points despite 10-100x more parameters. Through systematic evaluation of 31 models (0.5B-405B parameters) across 1,485 problems, we identify the mechanism as spontaneous scale-dependent verbosity that introduces errors through overelaboration. Causal intervention experiments demonstrate this reflects correctable prompt design rather than fundamental capability limitations. Constraining large models to produce brief responses improves accuracy by 26 percentage points and reduces performance gaps by up to two-thirds. Most critically, brevity constraints completely reverse performance hierarchies on mathematical reasoning and scientific knowledge benchmarks, with large models achieving 7.7-15.9 percentage point advantages over small models -- direct inversions of the original gaps. These reversals prove large models possess superior latent capabilities that universal prompting masks. We validate findings through three independent contamination tests and demonstrate inverse scaling operates continuously across the full parameter spectrum, with dataset-specific optimal scales ranging from 0.5B to 3.0B parameters. Our results establish that maximizing large model performance requires scale-aware prompt engineering rather than universal evaluation protocols, with immediate implications for deployment: prompt adaptation simultaneously improves accuracy and reduces computational costs.

Brevity Constraints Reverse Performance Hierarchies in Language Models

Abstract

Standard evaluation protocols reveal a counterintuitive phenomenon: on 7.7% of benchmark problems spanning five datasets, larger language models underperform smaller ones by 28.4 percentage points despite 10-100x more parameters. Through systematic evaluation of 31 models (0.5B-405B parameters) across 1,485 problems, we identify the mechanism as spontaneous scale-dependent verbosity that introduces errors through overelaboration. Causal intervention experiments demonstrate this reflects correctable prompt design rather than fundamental capability limitations. Constraining large models to produce brief responses improves accuracy by 26 percentage points and reduces performance gaps by up to two-thirds. Most critically, brevity constraints completely reverse performance hierarchies on mathematical reasoning and scientific knowledge benchmarks, with large models achieving 7.7-15.9 percentage point advantages over small models -- direct inversions of the original gaps. These reversals prove large models possess superior latent capabilities that universal prompting masks. We validate findings through three independent contamination tests and demonstrate inverse scaling operates continuously across the full parameter spectrum, with dataset-specific optimal scales ranging from 0.5B to 3.0B parameters. Our results establish that maximizing large model performance requires scale-aware prompt engineering rather than universal evaluation protocols, with immediate implications for deployment: prompt adaptation simultaneously improves accuracy and reduces computational costs.

Paper Structure

This paper contains 19 sections, 9 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Problem-level performance matrix reveals discriminative inefficiency. (A) Problem categorization across five benchmarks. Orange segments highlight inverse scaling problems where small models ($\leq$10B parameters) outperform large models ($\geq$70B parameters). (B) Overall distribution shows 7.7% inverse scaling rate across 1,485 problems with Cohen's $d=1.34$ effect size.
  • Figure 2: Discovery of systematic inverse scaling across benchmarks. (A) Prevalence ranges from 3.9% (MMLU-STEM) to 11.3% (BoolQ), with 115 total inverse problems. (B) Performance gap distribution shows mean 28.4pp advantage for small models ($\leq$10B). (C) Strong negative correlation between model size and accuracy on inverse problems; small models achieve 66.1% vs large models' 41.5%.
  • Figure 3: Causal evidence: brevity constraints eliminate inverse scaling. (A) Performance across three conditions shows large models improve dramatically under brevity constraints (Control: 40.2% → Brief: 66.5%, +26.3pp), reducing gap by 67% (44.2pp → 14.8pp, $t=7.80$, $p<0.0001$). (B) Gap reduction varies by dataset, with complete reversals in GSM8K and MMLU-STEM where brief condition causes large models to outperform. (C) Response length validation confirms intervention successfully manipulated verbosity (Control: 197 tokens → Brief: 78 tokens, 60% reduction), establishing causal link between overthinking and performance degradation.
  • Figure 4: Contamination analysis through three independent validation tests. (A) Response diversity heatmap shows 89--100% unique responses across datasets, contradicting memorization patterns which would produce template-like responses. (B) Length variability measured by coefficient of variation (CV) ranges from 0.31 to 1.21, with all datasets exceeding memorization threshold (CV $<$ 0.15) and 3/5 exceeding natural variation threshold (CV $>$ 0.30). (C) Error pattern classification reveals over-reasoning (verbose incorrect logic) as dominant failure mode (41--82% of large model failures), inconsistent with memorization hypothesis which predicts either correct retrieval or evasive brevity. Convergent evidence across all tests supports genuine capability differences rather than contamination artifacts.
  • Figure 5: Dataset-specific breakdown reveals heterogeneous inverse scaling patterns. For each benchmark: (Left) Problem-level accuracy heatmap comparing small models (top) versus large models (bottom), with inverse scaling problems marked by orange dashed lines. (Middle) Model family performance ranked by accuracy, colored by size (blue: $\leq$10B, red: $\geq$70B). (Right) Response length distributions for normal versus inverse problems. Results show: (1) inverse scaling occurs across all task types, (2) small models consistently outperform large models on inverse problems
  • ...and 1 more figures