Table of Contents
Fetching ...

Surprisal-Guided Selection: Compute-Optimal Test-Time Strategies for Execution-Grounded Code Generation

Jarrod Barnes

TL;DR

This work asks whether gradient-based test-time training is advantageous for dense, verifiable execution-grounded tasks and finds that, under matched compute, pure search with modest sampling outperforms adaptation. A surprisal-guided selection strategy—picking the highest-surprisal correct sample—achieves near-oracle performance and, with surprisal-guided-top3, exactly matches oracle results, all without extra compute. The authors reveal a failure mode they term over-sharpening, where gradient updates collapse sample diversity toward mediocre solutions and reduce discovery of tail-optimal kernels. They propose a Reward Compression Principle: dense rewards compress quickly (1–2 steps) and additional steps harm performance, suggesting zero-cost strategies are preferable in dense-VEG domains. The results imply that for dense, deterministic evaluation tasks like GPU kernel optimization, allocating compute to sampling diversity and intelligent selection is more effective than further gradient adaptation, with potential generalization to other execution-grounded domains.

Abstract

Test-time training (TTT) adapts language models through gradient-based updates at inference. But is adaptation the right strategy? We study compute-optimal test-time strategies for verifiable execution-grounded (VEG) tasks, domains like GPU kernel optimization where a deterministic evaluator provides dense, continuous reward signals. Using KernelBench as our testbed and a 120B-parameter model (GPT-OSS-120B with LoRA adaptation), we find that search outperforms minimal adaptation (1-5 gradient steps): Best-of-N sampling achieves 90% task success (18/20 tasks) at K=64 across the full KernelBench L1 eval set while TTT's best checkpoint reaches only 30.6% (3-seed mean), with TTT's "equivalent K" falling below 1, worse than single-sample inference. The failure mode is over-sharpening: gradient updates collapse diversity toward mediocre solutions rather than discovering optimal ones. Our main contribution is surprisal-guided selection: selecting the highest-surprisal (lowest-confidence) correct sample yields 80% success vs. 50% for most-confident selection, a 30% improvement. Extending to surprisal-guided-top3 matches oracle performance at 100%. This zero-cost strategy, validated through length-controlled analysis, recovers oracle performance. For dense-reward VEG tasks, compute should be allocated to sample diversity and intelligent selection rather than gradient adaptation. The surprisal-guided selection principle may generalize to other execution-grounded domains where optimal solutions occupy the distribution tail.

Surprisal-Guided Selection: Compute-Optimal Test-Time Strategies for Execution-Grounded Code Generation

TL;DR

This work asks whether gradient-based test-time training is advantageous for dense, verifiable execution-grounded tasks and finds that, under matched compute, pure search with modest sampling outperforms adaptation. A surprisal-guided selection strategy—picking the highest-surprisal correct sample—achieves near-oracle performance and, with surprisal-guided-top3, exactly matches oracle results, all without extra compute. The authors reveal a failure mode they term over-sharpening, where gradient updates collapse sample diversity toward mediocre solutions and reduce discovery of tail-optimal kernels. They propose a Reward Compression Principle: dense rewards compress quickly (1–2 steps) and additional steps harm performance, suggesting zero-cost strategies are preferable in dense-VEG domains. The results imply that for dense, deterministic evaluation tasks like GPU kernel optimization, allocating compute to sampling diversity and intelligent selection is more effective than further gradient adaptation, with potential generalization to other execution-grounded domains.

Abstract

Test-time training (TTT) adapts language models through gradient-based updates at inference. But is adaptation the right strategy? We study compute-optimal test-time strategies for verifiable execution-grounded (VEG) tasks, domains like GPU kernel optimization where a deterministic evaluator provides dense, continuous reward signals. Using KernelBench as our testbed and a 120B-parameter model (GPT-OSS-120B with LoRA adaptation), we find that search outperforms minimal adaptation (1-5 gradient steps): Best-of-N sampling achieves 90% task success (18/20 tasks) at K=64 across the full KernelBench L1 eval set while TTT's best checkpoint reaches only 30.6% (3-seed mean), with TTT's "equivalent K" falling below 1, worse than single-sample inference. The failure mode is over-sharpening: gradient updates collapse diversity toward mediocre solutions rather than discovering optimal ones. Our main contribution is surprisal-guided selection: selecting the highest-surprisal (lowest-confidence) correct sample yields 80% success vs. 50% for most-confident selection, a 30% improvement. Extending to surprisal-guided-top3 matches oracle performance at 100%. This zero-cost strategy, validated through length-controlled analysis, recovers oracle performance. For dense-reward VEG tasks, compute should be allocated to sample diversity and intelligent selection rather than gradient adaptation. The surprisal-guided selection principle may generalize to other execution-grounded domains where optimal solutions occupy the distribution tail.
Paper Structure (40 sections, 1 equation, 7 figures, 12 tables, 1 algorithm)

This paper contains 40 sections, 1 equation, 7 figures, 12 tables, 1 algorithm.

Figures (7)

  • Figure 1: Test-time strategy comparison. Best-of-N scaling (gray) saturates at $K\!=\!16$. At $K\!=\!64$, TTT (31%, red) is 2$\times$ worse than random selection (59%); surprisal-guided (blue) matches oracle. The +30% bracket: confidence (50%) vs. surprisal (80%).
  • Figure 2: Dual-loop architecture. The outer loop (blue) trains a base policy via reinforcement learning from verifiable rewards (RLVR) on 80 KernelBench tasks. The inner loop (green) compares test-time strategies under matched compute. The selection mechanism (dashed box) determines how to choose among correct samples. Both loops share the same evaluator (orange).
  • Figure 3: Best-of-N scaling curve. Performance saturates at $K\!=\!16$ (99.9%). TTT BoA at 30.6% falls below $K\!=\!1$ random sampling (53.3%).
  • Figure 4: Selection strategy comparison. (a) fast_1 success rate and (b) mean speedup. Surprisal-guided achieves 80% vs. 50% for confidence-guided (+30%). Surprisal-guided-top3 matches oracle.
  • Figure 5: Adaptation trajectory. Performance peaks at 1--2 steps then regresses. Stars mark BoA-selected checkpoints.
  • ...and 2 more figures