Table of Contents
Fetching ...

Learning Adaptive LLM Decoding

Chloe H. Su, Zhe Ye, Samuel Tenka, Aidan Yang, Soonho Kong, Udaya Ghai

TL;DR

This work introduces lightweight decoding adapters trained with reinforcement learning and verifiable terminal rewards to learn adaptive decoding policies that dynamically select sampling strategies at inference time, conditioned on available compute resources.

Abstract

Decoding from large language models (LLMs) typically relies on fixed sampling hyperparameters (e.g., temperature, top-p), despite substantial variation in task difficulty and uncertainty across prompts and individual decoding steps. We propose to learn adaptive decoding policies that dynamically select sampling strategies at inference time, conditioned on available compute resources. Rather than fine-tuning the language model itself, we introduce lightweight decoding adapters trained with reinforcement learning and verifiable terminal rewards (e.g. correctness on math and coding tasks). At the sequence level, we frame decoding as a contextual bandit problem: a policy selects a decoding strategy (e.g. greedy, top-k, min-p) for each prompt, conditioned on the prompt embedding and a parallel sampling budget. At the token level, we model decoding as a partially observable Markov decision process (POMDP), where a policy selects sampling actions at each token step based on internal model features and the remaining token budget. Experiments on the MATH and CodeContests benchmarks show that the learned adapters improve the accuracy-budget tradeoff: on MATH, the token-level adapter improves Pass@1 accuracy by up to 10.2% over the best static baseline under a fixed token budget, while the sequence-level adapter yields 2-3% gains under fixed parallel sampling. Ablation analyses support the contribution of both sequence- and token-level adaptation.

Learning Adaptive LLM Decoding

TL;DR

This work introduces lightweight decoding adapters trained with reinforcement learning and verifiable terminal rewards to learn adaptive decoding policies that dynamically select sampling strategies at inference time, conditioned on available compute resources.

Abstract

Decoding from large language models (LLMs) typically relies on fixed sampling hyperparameters (e.g., temperature, top-p), despite substantial variation in task difficulty and uncertainty across prompts and individual decoding steps. We propose to learn adaptive decoding policies that dynamically select sampling strategies at inference time, conditioned on available compute resources. Rather than fine-tuning the language model itself, we introduce lightweight decoding adapters trained with reinforcement learning and verifiable terminal rewards (e.g. correctness on math and coding tasks). At the sequence level, we frame decoding as a contextual bandit problem: a policy selects a decoding strategy (e.g. greedy, top-k, min-p) for each prompt, conditioned on the prompt embedding and a parallel sampling budget. At the token level, we model decoding as a partially observable Markov decision process (POMDP), where a policy selects sampling actions at each token step based on internal model features and the remaining token budget. Experiments on the MATH and CodeContests benchmarks show that the learned adapters improve the accuracy-budget tradeoff: on MATH, the token-level adapter improves Pass@1 accuracy by up to 10.2% over the best static baseline under a fixed token budget, while the sequence-level adapter yields 2-3% gains under fixed parallel sampling. Ablation analyses support the contribution of both sequence- and token-level adaptation.
Paper Structure (72 sections, 13 equations, 8 figures, 15 tables, 2 algorithms)

This paper contains 72 sections, 13 equations, 8 figures, 15 tables, 2 algorithms.

Figures (8)

  • Figure 1: Overview of the proposed decoding adapter (DA) for a frozen language model (LM). Blue blocks denote input tokens $x_i$; green blocks denote generated tokens $y_i$.
  • Figure 2: Action distributions and validation reward on MATH for sequence-level adapter strategies. Actions are defined in \ref{['tab:math']}. The x-axis denotes training progress; the left y-axis reports action probability; the right y-axis reports validation reward.
  • Figure 3: Entropy modulation under token-level control. Token entropy distributions for a representative static strategy versus the learned token-level adapter, using the temperature-only action set from \ref{['tab:math-token']}. Action 0 is greedy; larger action indices correspond to higher temperature. Qualitatively, the adapter more often preserves stochasticity on higher-entropy tokens, while collapsing many low-entropy tokens to near-deterministic behavior.
  • Figure 4: Generation length distributions. Generation length statistics for a representative static strategy and the learned adapters. Length shifts are present but not extreme, suggesting the token-level gains are not explained solely by systematic truncation/verbosity changes.
  • Figure 5: Word clouds by action. Most frequent tokens under each temperature-based action from \ref{['tab:math-token']}. This is a lightweight diagnostic and does not by itself explain the policy's performance improvements.
  • ...and 3 more figures