Orthographic Constraint Satisfaction and Human Difficulty Alignment in Large Language Models
Bryan E. Tuck, Rakesh M. Verma
TL;DR
The paper investigates how large language models satisfy hard orthographic constraints during constrained generation, using 58 word puzzles across 28 configurations from three model families. It demonstrates that architectural differences produce substantially larger performance gaps ($2.0$--$2.2\times$ in $F_1$) than parameter scaling within families, suggesting that specialized architectural features or training objectives are required beyond scaling. Budget sensitivity is heterogeneous: high-capacity models show notable gains ($\Delta F_1$ in the range $+0.102$ to $+0.136$), while mid-sized models degrade or plateau, and proprietary models maintain efficient budget usage. Calibration against 10,000 human solvers per puzzle yields modest alignment ($r$ in $0.24$--$0.38$) but reveals systematic failures on orthographically unusual words (e.g., "data", "poop", "loll"), implying reliance on distributional plausibility and motivating architectural innovations that explicitly track and verify constraints rather than relying solely on scaling or broader reasoning.
Abstract
Large language models must satisfy hard orthographic constraints during controlled text generation, yet systematic cross-architecture evaluation remains limited. We evaluate 28 configurations spanning three model families (Qwen3, Claude Haiku-4.5, GPT-5-mini) on 58 word puzzles requiring character-level constraint satisfaction. Architectural differences produce substantially larger performance gaps (2.0-2.2x, F1=0.761 vs. 0.343) than parameter scaling within families (83% gain from eightfold scaling), suggesting that constraint satisfaction may require specialized architectural features or training objectives beyond standard language model scaling. Thinking budget sensitivity proves heterogeneous: high-capacity models show strong returns (+0.102 to +0.136 F1), while mid-sized variants saturate or degrade. These patterns are inconsistent with uniform compute benefits. Using difficulty ratings from 10,000 human solvers per puzzle, we establish modest but consistent calibration (r=0.24-0.38) across all families, yet identify systematic failures on common words with unusual orthography ("data", "poop", "loll": 86-95% human success, 89-96% model miss rate). These failures reveal over-reliance on distributional plausibility that penalizes orthographically atypical but constraint-valid patterns, suggesting architectural innovations may be required beyond simply scaling parameters or computational budgets.
