Table of Contents
Fetching ...

The Perplexity Paradox: Why Code Compresses Better Than Math in LLM Prompts

Warren Johnson

TL;DR

This work addresses why code prompts tolerate aggressive compression better than math-focused reasoning in LLMs by validating a task-dependent compression threshold across code and reasoning benchmarks, and by empirically confirming a perplexity paradox at the token level. It introduces TAAC, a Task-Aware Adaptive Compression algorithm that uses task type, token-density, and a quality predictor to dynamically adjust compression, achieving superior cost-quality tradeoffs over fixed-ratio baselines. The study demonstrates cross-benchmark generalization (e.g., MBPP), causal validation via signature preservation (reducing NameError from 86.1% to 6.1% and increasing pass rate by ~34pp), and robust mechanism through per-token perplexity analysis showing high-perplexity syntax tokens are retained while low-perplexity numerical values in math are pruned. The practical impact is a scalable approach to reducing inference costs in LLM deployments without sacrificing accuracy, with broader design implications for prompt compression and task-aware optimization.

Abstract

In "Compress or Route?" (Johnson, 2026), we found that code generation tolerates aggressive prompt compression (r >= 0.6) while chain-of-thought reasoning degrades gradually. That study was limited to HumanEval (164 problems), left the "perplexity paradox" mechanism unvalidated, and provided no adaptive algorithm. This paper addresses all three gaps. First, we validate across six code benchmarks (HumanEval, MBPP, HumanEval+, MultiPL-E) and four reasoning benchmarks (GSM8K, MATH, ARC-Challenge, MMLU-STEM), confirming the compression threshold generalizes across languages and difficulties. Second, we conduct the first per-token perplexity analysis (n=723 tokens), revealing a "perplexity paradox": code syntax tokens are preserved (high perplexity) while numerical values in math problems are pruned despite being task-critical (low perplexity). Signature injection recovers +34 percentage points in pass rate (5.3% to 39.3%; Cohen's h=0.890). Third, we propose TAAC (Task-Aware Adaptive Compression), achieving 22% cost reduction with 96% quality preservation, outperforming fixed-ratio compression by 7%. MBPP validation (n=1,800 trials) confirms systematic variation: 3.6% at r=0.3 to 54.6% at r=1.0.

The Perplexity Paradox: Why Code Compresses Better Than Math in LLM Prompts

TL;DR

This work addresses why code prompts tolerate aggressive compression better than math-focused reasoning in LLMs by validating a task-dependent compression threshold across code and reasoning benchmarks, and by empirically confirming a perplexity paradox at the token level. It introduces TAAC, a Task-Aware Adaptive Compression algorithm that uses task type, token-density, and a quality predictor to dynamically adjust compression, achieving superior cost-quality tradeoffs over fixed-ratio baselines. The study demonstrates cross-benchmark generalization (e.g., MBPP), causal validation via signature preservation (reducing NameError from 86.1% to 6.1% and increasing pass rate by ~34pp), and robust mechanism through per-token perplexity analysis showing high-perplexity syntax tokens are retained while low-perplexity numerical values in math are pruned. The practical impact is a scalable approach to reducing inference costs in LLM deployments without sacrificing accuracy, with broader design implications for prompt compression and task-aware optimization.

Abstract

In "Compress or Route?" (Johnson, 2026), we found that code generation tolerates aggressive prompt compression (r >= 0.6) while chain-of-thought reasoning degrades gradually. That study was limited to HumanEval (164 problems), left the "perplexity paradox" mechanism unvalidated, and provided no adaptive algorithm. This paper addresses all three gaps. First, we validate across six code benchmarks (HumanEval, MBPP, HumanEval+, MultiPL-E) and four reasoning benchmarks (GSM8K, MATH, ARC-Challenge, MMLU-STEM), confirming the compression threshold generalizes across languages and difficulties. Second, we conduct the first per-token perplexity analysis (n=723 tokens), revealing a "perplexity paradox": code syntax tokens are preserved (high perplexity) while numerical values in math problems are pruned despite being task-critical (low perplexity). Signature injection recovers +34 percentage points in pass rate (5.3% to 39.3%; Cohen's h=0.890). Third, we propose TAAC (Task-Aware Adaptive Compression), achieving 22% cost reduction with 96% quality preservation, outperforming fixed-ratio compression by 7%. MBPP validation (n=1,800 trials) confirms systematic variation: 3.6% at r=0.3 to 54.6% at r=1.0.
Paper Structure (52 sections, 13 equations, 4 figures, 15 tables, 1 algorithm)

This paper contains 52 sections, 13 equations, 4 figures, 15 tables, 1 algorithm.

Figures (4)

  • Figure 1: Quality preservation under compression for Code and Chain-of-Thought (CoT) tasks from length-controlled analysis. Code tasks (blue squares) exhibit threshold behavior: quality remains high ($>0.99$) at $r \geq 0.6$, with a sharp cliff below the threshold. CoT tasks (orange triangles) show steeper degradation at low compression ratios but peak at $r=0.6$ before declining at $r=0.7$. Shaded regions indicate approximate 95% confidence intervals. Percentages show success rates at each compression level. The vertical dashed line marks the optimal compression threshold $r^* = 0.6$, where both task types achieve $\geq 99\%$ quality preservation.
  • Figure 2: Mean perplexity by token category from empirical analysis of 722 tokens (log scale). Python syntax exhibits the highest perplexity ($\mu = 928{,}636$), indicating strong preservation under compression. The 79$\times$ ratio between Python Syntax and Content Words ($\mu = 11{,}697$) demonstrates dramatic category-dependent variation. Notably, Numbers show paradoxically low perplexity ($\mu = 9{,}195$) despite being task-critical for reasoning---explaining why compression algorithms preferentially prune numerical values. Color gradient encodes perplexity magnitude from blue (low) to red (high); sample sizes shown as $n$ values.
  • Figure 3: TAAC (Task-Aware Adaptive Compression) system architecture. Stage 1: A DistilBERT classifier categorizes the input prompt into task types (code, cot, or hybrid). Stage 2: Token-level perplexity analysis estimates information density $\rho$, identifying which tokens are most compressible. Stage 3: Quality-gated compression iteratively compresses the prompt while a quality predictor monitors output quality. The target compression ratio $r^*_t$ is determined by task type: $r^*_{\texttt{code}} = 0.65$ (code tolerates aggressive compression) and $r^*_{\texttt{cot}} = 0.80$ (reasoning requires higher preservation). The feedback loop (dashed) ensures compression stops if predicted quality $\hat{q}$ drops below threshold $\tau$.
  • Figure 4: Pareto frontier comparing TAAC against fixed-ratio compression strategies. Points on the dashed frontier represent Pareto-optimal configurations---no other strategy achieves both higher quality and greater cost savings. TAAC achieves 95.6% quality preservation at 21.8% cost savings, outperforming Fixed $r=0.6$ by +6.5 percentage points in quality while requiring less aggressive compression. Task-Based Fixed and Fixed $r=0.7$ fall within the dominated region, indicating suboptimal cost-quality tradeoffs. The shaded region represents configurations that are strictly dominated by points on the Pareto frontier.

Theorems & Definitions (2)

  • Definition 1: The Perplexity Paradox
  • Definition 2: Semantic Necessity Score