Table of Contents
Fetching ...

Orthographic Constraint Satisfaction and Human Difficulty Alignment in Large Language Models

Bryan E. Tuck, Rakesh M. Verma

TL;DR

The paper investigates how large language models satisfy hard orthographic constraints during constrained generation, using 58 word puzzles across 28 configurations from three model families. It demonstrates that architectural differences produce substantially larger performance gaps ($2.0$--$2.2\times$ in $F_1$) than parameter scaling within families, suggesting that specialized architectural features or training objectives are required beyond scaling. Budget sensitivity is heterogeneous: high-capacity models show notable gains ($\Delta F_1$ in the range $+0.102$ to $+0.136$), while mid-sized models degrade or plateau, and proprietary models maintain efficient budget usage. Calibration against 10,000 human solvers per puzzle yields modest alignment ($r$ in $0.24$--$0.38$) but reveals systematic failures on orthographically unusual words (e.g., "data", "poop", "loll"), implying reliance on distributional plausibility and motivating architectural innovations that explicitly track and verify constraints rather than relying solely on scaling or broader reasoning.

Abstract

Large language models must satisfy hard orthographic constraints during controlled text generation, yet systematic cross-architecture evaluation remains limited. We evaluate 28 configurations spanning three model families (Qwen3, Claude Haiku-4.5, GPT-5-mini) on 58 word puzzles requiring character-level constraint satisfaction. Architectural differences produce substantially larger performance gaps (2.0-2.2x, F1=0.761 vs. 0.343) than parameter scaling within families (83% gain from eightfold scaling), suggesting that constraint satisfaction may require specialized architectural features or training objectives beyond standard language model scaling. Thinking budget sensitivity proves heterogeneous: high-capacity models show strong returns (+0.102 to +0.136 F1), while mid-sized variants saturate or degrade. These patterns are inconsistent with uniform compute benefits. Using difficulty ratings from 10,000 human solvers per puzzle, we establish modest but consistent calibration (r=0.24-0.38) across all families, yet identify systematic failures on common words with unusual orthography ("data", "poop", "loll": 86-95% human success, 89-96% model miss rate). These failures reveal over-reliance on distributional plausibility that penalizes orthographically atypical but constraint-valid patterns, suggesting architectural innovations may be required beyond simply scaling parameters or computational budgets.

Orthographic Constraint Satisfaction and Human Difficulty Alignment in Large Language Models

TL;DR

The paper investigates how large language models satisfy hard orthographic constraints during constrained generation, using 58 word puzzles across 28 configurations from three model families. It demonstrates that architectural differences produce substantially larger performance gaps (-- in ) than parameter scaling within families, suggesting that specialized architectural features or training objectives are required beyond scaling. Budget sensitivity is heterogeneous: high-capacity models show notable gains ( in the range to ), while mid-sized models degrade or plateau, and proprietary models maintain efficient budget usage. Calibration against 10,000 human solvers per puzzle yields modest alignment ( in --) but reveals systematic failures on orthographically unusual words (e.g., "data", "poop", "loll"), implying reliance on distributional plausibility and motivating architectural innovations that explicitly track and verify constraints rather than relying solely on scaling or broader reasoning.

Abstract

Large language models must satisfy hard orthographic constraints during controlled text generation, yet systematic cross-architecture evaluation remains limited. We evaluate 28 configurations spanning three model families (Qwen3, Claude Haiku-4.5, GPT-5-mini) on 58 word puzzles requiring character-level constraint satisfaction. Architectural differences produce substantially larger performance gaps (2.0-2.2x, F1=0.761 vs. 0.343) than parameter scaling within families (83% gain from eightfold scaling), suggesting that constraint satisfaction may require specialized architectural features or training objectives beyond standard language model scaling. Thinking budget sensitivity proves heterogeneous: high-capacity models show strong returns (+0.102 to +0.136 F1), while mid-sized variants saturate or degrade. These patterns are inconsistent with uniform compute benefits. Using difficulty ratings from 10,000 human solvers per puzzle, we establish modest but consistent calibration (r=0.24-0.38) across all families, yet identify systematic failures on common words with unusual orthography ("data", "poop", "loll": 86-95% human success, 89-96% model miss rate). These failures reveal over-reliance on distributional plausibility that penalizes orthographically atypical but constraint-valid patterns, suggesting architectural innovations may be required beyond simply scaling parameters or computational budgets.

Paper Structure

This paper contains 33 sections, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Zero-shot prompt structure. The prompt specifies the seven available letters, marks the mandatory center letter, and enumerates all constraints explicitly. Models receive identical specifications without solution counts, isolating intrinsic constraint-handling from memorization or calibration to expected output lengths.
  • Figure 2: Cross-family performance comparison across thinking budgets. Proprietary models achieve 2.0--2.2$\times$ higher F1 than the largest open-source model, with the gap driven primarily by recall (68% vs. 23%) rather than precision. Budget sensitivity varies dramatically across families.
  • Figure 3: Heterogeneous budget sensitivity across model sizes. Smaller models remain flat or decline with additional budget, while the 14B variant paradoxically degrades. Proprietary systems show consistent improvements.
  • Figure 4: Model-human difficulty calibration using 10,000 solver ratings per puzzle. Left: Performance gradients from easy to hard words vary by model capacity (19$\times$ drop for Qwen-4B vs. 2.5$\times$ for GPT-5-mini). Right: Calibration strength shows modest alignment (r=0.24--0.38), with proprietary models achieving higher correlations.
  • Figure 5: Word length effects on model and human performance. Left: Model recall by word length. Right: Human success declines gently (1.3× drop) while models show catastrophic degradation (1.5--82× drops).