Table of Contents
Fetching ...

Separating Constraint Compliance from Semantic Accuracy: A Novel Benchmark for Evaluating Instruction-Following Under Compression

Rahul Baxi

TL;DR

The paper introduces the Compression-Decay Comprehension Test (CDCT) to independently quantify constraint compliance (CC) and semantic accuracy (SA) as prompt length varies. By evaluating 9 frontier LLMs across 8 concepts and 5 compression levels with a three-judge LLM jury, it reveals a universal U-curve in CC and orthogonality between CC and SA, showing that constraint violations are a stronger and more objective failure mode than semantic errors. The results demonstrate architecture matters, with reasoning models outperforming efficient ones, and provide experimental validation for the constraint salience hypothesis via RLHF ablation that dramatically improves compliance. The findings have practical implications for deployment and prompt design, highlighting an instruction-ambiguity zone and offering actionable guidelines to improve instruction-following robustness beyond compression alone.

Abstract

Large language models (LLMs) exhibit degraded performance under prompt compression, but the mechanisms remain poorly understood. We introduce the Compression-Decay Comprehension Test (CDCT), a benchmark that independently measures constraint compliance (CC) and semantic accuracy (SA) across compression levels. We evaluate 9 frontier LLMs across 8 concepts using 5 compression levels from extreme (c=0.0, ~2 words) to none (c=1.0, ~135 words). A three-judge LLM jury achieves almost perfect inter-rater agreement on CC (Fleiss' \k{appa}=0.90). We observe a universal U-curve pattern in constraint compliance (97.2% prevalence), with violations peaking at medium compression (c=0.5, ~27 words). Counterintuitively, models perform better at extreme compression than medium lengths. The dimensions are statistically orthogonal (r=0.193, p=0.084), with constraint effects 2.9x larger than semantic effects. Experimental validation via RLHF ablation confirms our constraint salience hypothesis: removing "helpfulness" signals improves CC by 598% on average (71/72 trials, p<0.001), with 79% achieving perfect compliance. This demonstrates that RLHF-trained helpfulness behaviors are the dominant cause of constraint violations at medium compression. Reasoning models outperform efficient models by 27.5% (Cohen's d=0.96). Our findings reveal a fundamental tension between RLHF alignment and instruction-following, providing actionable guidelines for improving deployed systems.

Separating Constraint Compliance from Semantic Accuracy: A Novel Benchmark for Evaluating Instruction-Following Under Compression

TL;DR

The paper introduces the Compression-Decay Comprehension Test (CDCT) to independently quantify constraint compliance (CC) and semantic accuracy (SA) as prompt length varies. By evaluating 9 frontier LLMs across 8 concepts and 5 compression levels with a three-judge LLM jury, it reveals a universal U-curve in CC and orthogonality between CC and SA, showing that constraint violations are a stronger and more objective failure mode than semantic errors. The results demonstrate architecture matters, with reasoning models outperforming efficient ones, and provide experimental validation for the constraint salience hypothesis via RLHF ablation that dramatically improves compliance. The findings have practical implications for deployment and prompt design, highlighting an instruction-ambiguity zone and offering actionable guidelines to improve instruction-following robustness beyond compression alone.

Abstract

Large language models (LLMs) exhibit degraded performance under prompt compression, but the mechanisms remain poorly understood. We introduce the Compression-Decay Comprehension Test (CDCT), a benchmark that independently measures constraint compliance (CC) and semantic accuracy (SA) across compression levels. We evaluate 9 frontier LLMs across 8 concepts using 5 compression levels from extreme (c=0.0, ~2 words) to none (c=1.0, ~135 words). A three-judge LLM jury achieves almost perfect inter-rater agreement on CC (Fleiss' \k{appa}=0.90). We observe a universal U-curve pattern in constraint compliance (97.2% prevalence), with violations peaking at medium compression (c=0.5, ~27 words). Counterintuitively, models perform better at extreme compression than medium lengths. The dimensions are statistically orthogonal (r=0.193, p=0.084), with constraint effects 2.9x larger than semantic effects. Experimental validation via RLHF ablation confirms our constraint salience hypothesis: removing "helpfulness" signals improves CC by 598% on average (71/72 trials, p<0.001), with 79% achieving perfect compliance. This demonstrates that RLHF-trained helpfulness behaviors are the dominant cause of constraint violations at medium compression. Reasoning models outperform efficient models by 27.5% (Cohen's d=0.96). Our findings reveal a fundamental tension between RLHF alignment and instruction-following, providing actionable guidelines for improving deployed systems.

Paper Structure

This paper contains 35 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Universal U-curve pattern in constraint compliance. Each line represents one model evaluated across compression levels. Dashed line shows the mean trajectory with 95% CI. The U-curve is near-universal (97.2% prevalence across 72 experiments).
  • Figure 2: Scatter plot of Constraint Compliance vs. Semantic Accuracy across all 81 experiments. The weak correlation (r=0.193, p=0.084) demonstrates statistical independence of the two dimensions.
  • Figure 3: Constraint compliance trajectories for all models. The U-curve pattern is visible across all architectures, with reasoning models (O3, GPT-5, O4-Mini) showing higher overall CC and smaller dips at c=0.5.
  • Figure 4: Semantic accuracy trajectories for all models. SA improves monotonically with more context (decreasing compression). Unlike CC, SA does not exhibit a U-curve.
  • Figure 5: Model comparison across Constraint Compliance and Semantic Accuracy. Reasoning models cluster in the upper-right (high CC, high SA), while efficient models show more variation. Error bars represent 95% CI.
  • ...and 4 more figures