Table of Contents
Fetching ...

Adaptive Decoding via Latent Preference Optimization

Shehzaad Dhuliawala, Ilia Kulikov, Ping Yu, Asli Celikyilmaz, Jason Weston, Sainbayar Sukhbaatar, Jack Lanchantin

TL;DR

The paper addresses decoding reliability and diversity by introducing AdaptiveDecoder, a learnable layer that dynamically selects decoding temperatures at token or sequence level. Trained with Latent Preference Optimization (LPO), the approach treats temperature selection as a discrete latent variable and uses preference-based signals to optimize it, outperforming all fixed-temperature baselines across math reasoning, creative writing, and instruction-following tasks. Key contributions include a practical integration of AdaptiveDecoder with frozen LLMs, a general LPO framework for discrete latent decisions, and extensive demonstrations across diverse tasks showing improved performance, reduced repetition, and better constraint adherence. The work suggests broad applicability to other decoding hyperparameters and offers a scalable path toward task-aware, adaptive generation in large language models.

Abstract

During language model decoding, it is known that using higher temperature sampling gives more creative responses, while lower temperatures are more factually accurate. However, such models are commonly applied to general instruction following, which involves both creative and fact seeking tasks, using a single fixed temperature across all examples and tokens. In this work, we introduce Adaptive Decoding, a layer added to the model to select the sampling temperature dynamically at inference time, at either the token or example level, in order to optimize performance. To learn its parameters we introduce Latent Preference Optimization (LPO) a general approach to train discrete latent variables such as choices of temperature. Our method outperforms all fixed decoding temperatures across a range of tasks that require different temperatures, including UltraFeedback, Creative Story Writing, and GSM8K.

Adaptive Decoding via Latent Preference Optimization

TL;DR

The paper addresses decoding reliability and diversity by introducing AdaptiveDecoder, a learnable layer that dynamically selects decoding temperatures at token or sequence level. Trained with Latent Preference Optimization (LPO), the approach treats temperature selection as a discrete latent variable and uses preference-based signals to optimize it, outperforming all fixed-temperature baselines across math reasoning, creative writing, and instruction-following tasks. Key contributions include a practical integration of AdaptiveDecoder with frozen LLMs, a general LPO framework for discrete latent decisions, and extensive demonstrations across diverse tasks showing improved performance, reduced repetition, and better constraint adherence. The work suggests broad applicability to other decoding hyperparameters and offers a scalable path toward task-aware, adaptive generation in large language models.

Abstract

During language model decoding, it is known that using higher temperature sampling gives more creative responses, while lower temperatures are more factually accurate. However, such models are commonly applied to general instruction following, which involves both creative and fact seeking tasks, using a single fixed temperature across all examples and tokens. In this work, we introduce Adaptive Decoding, a layer added to the model to select the sampling temperature dynamically at inference time, at either the token or example level, in order to optimize performance. To learn its parameters we introduce Latent Preference Optimization (LPO) a general approach to train discrete latent variables such as choices of temperature. Our method outperforms all fixed decoding temperatures across a range of tasks that require different temperatures, including UltraFeedback, Creative Story Writing, and GSM8K.

Paper Structure

This paper contains 25 sections, 13 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: The $\textsc{AdaptiveDecoder}$. This learned module is added to the standard transformer in order to select decoding hyperparameters. It consists of a new decoder head attached to the last hidden state which assigns probabilities to different hyperparameter choices per token (right) or sequence (left), and the highest probability choice is selected in each case. This allows the LLM to select low temperatures for tokens requiring factual consistency, and higher temperatures for tasks requiring creativity and diversity. For the token level adaptive decoder, a different temperature can be selected for different parts of the response given a single instruction.
  • Figure 2: Latent Preference Optimization (LPO) Training Mechanism. We demonstrate how preference pairs are constructed for training the LPO loss (we show a Sequence-Level $\textsc{AdaptiveDecoder}$, but the procedure remains the same for Token-Level). Here we have N=2 generated response samples for a single prompt, and the Reward Model (RM) scores Response$_1$ better than Response$_2$. Therefore, we use $\tau=0.6$ as the chosen temperature, and $\tau=0.2$ as the rejected temperature, and then apply the loss to prefer the chosen temperature over the rejected one for the given context (prompt).
  • Figure 3: UltraMathStories Results. UltraMathStories is a superset of UltraFeedback, GSM8K, and Stories. The Adaptive Decoding models are trained on all 3 subtasks simultaneously. Winrates are shown as the average winrate across the test sets of the 3 subtasks in UltraMathStories. (left)$\textsc{AdaptiveDecoder}_{seq}$ vs Fixed Temperature Winrates. (right)$\textsc{AdaptiveDecoder}_{tok}$ vs Fixed Temperature Winrates. In both cases, Adaptive Decoding outperforms all fixed temperatures.
  • Figure 4: $\textsc{AdaptiveDecoder}_{seq}$ predicted temperature distributions. We show the distribution of predicted temperatures on the test set of each subtask in UltraMathStories. As expected, the model predicts low temperatures for GSM8K, high temperatures for Stories, and temperatures mostly in between for UltraFeedback.
  • Figure 5: Constrained Creative Writing (ConstrainedStories) Results. Here we show a quantitative analysis of the $\textsc{AdaptiveDecoder}$ on the constrained creative writing task, ConstrainedStories. (left)$\textsc{AdaptiveDecoder}_{tok}$ winrates vs fixed temperatures. The high fixed temperatures perform worse because they fail to follow the constraint. Fixed greedy decoding works well at following the constraint, but $\textsc{AdaptiveDecoder}_{tok}$ outperforms it by using higher temperatures when possible. (right) Mean temperature predicted by the $\textsc{AdaptiveDecoder}_{tok}$ for the first 50 tokens of each sentence. This plot confirms our hypothesis that the first token of each sentence should be low temperature in order to follow the constraint, and all other tokens should be high temperature in order to write a good story. The average temperature for the first token is $\tau=0.21$, and the average temperature for all other tokens is $\tau=0.55$, showing a more greedy decoding for the constraint, and less greedy everywhere else.
  • ...and 2 more figures