Separating Constraint Compliance from Semantic Accuracy: A Novel Benchmark for Evaluating Instruction-Following Under Compression
Rahul Baxi
TL;DR
The paper introduces the Compression-Decay Comprehension Test (CDCT) to independently quantify constraint compliance (CC) and semantic accuracy (SA) as prompt length varies. By evaluating 9 frontier LLMs across 8 concepts and 5 compression levels with a three-judge LLM jury, it reveals a universal U-curve in CC and orthogonality between CC and SA, showing that constraint violations are a stronger and more objective failure mode than semantic errors. The results demonstrate architecture matters, with reasoning models outperforming efficient ones, and provide experimental validation for the constraint salience hypothesis via RLHF ablation that dramatically improves compliance. The findings have practical implications for deployment and prompt design, highlighting an instruction-ambiguity zone and offering actionable guidelines to improve instruction-following robustness beyond compression alone.
Abstract
Large language models (LLMs) exhibit degraded performance under prompt compression, but the mechanisms remain poorly understood. We introduce the Compression-Decay Comprehension Test (CDCT), a benchmark that independently measures constraint compliance (CC) and semantic accuracy (SA) across compression levels. We evaluate 9 frontier LLMs across 8 concepts using 5 compression levels from extreme (c=0.0, ~2 words) to none (c=1.0, ~135 words). A three-judge LLM jury achieves almost perfect inter-rater agreement on CC (Fleiss' \k{appa}=0.90). We observe a universal U-curve pattern in constraint compliance (97.2% prevalence), with violations peaking at medium compression (c=0.5, ~27 words). Counterintuitively, models perform better at extreme compression than medium lengths. The dimensions are statistically orthogonal (r=0.193, p=0.084), with constraint effects 2.9x larger than semantic effects. Experimental validation via RLHF ablation confirms our constraint salience hypothesis: removing "helpfulness" signals improves CC by 598% on average (71/72 trials, p<0.001), with 79% achieving perfect compliance. This demonstrates that RLHF-trained helpfulness behaviors are the dominant cause of constraint violations at medium compression. Reasoning models outperform efficient models by 27.5% (Cohen's d=0.96). Our findings reveal a fundamental tension between RLHF alignment and instruction-following, providing actionable guidelines for improving deployed systems.
