Table of Contents
Fetching ...

Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model

Haikang Deng, Colin Raffel

TL;DR

Controlling generative text from large language models without expensive retraining is a key challenge. Reward-Augmented Decoding (RAD) introduces a unidirectional reward model that scores candidate continuations and biases next-token sampling through a top-$k$ softmax reweighting, while caching activations for efficiency. RAD consistently outperforms prior weighted decoding methods and matches state-of-the-art retraining approaches on detoxification and sentiment tasks, with minimal overhead when the reward model is small relative to the base LM, and scales to models like LLaMA-65B. This modular decoding strategy enables practical, scalable control of very large language models for safety and attribute-oriented generation, with potential extensions to broader objectives and instruction following.

Abstract

While large language models have proven effective in a huge range of downstream applications, they often generate text that is problematic or lacks a desired attribute. In this paper, we introduce Reward-Augmented Decoding (RAD), a text generation procedure that uses a small unidirectional reward model to encourage a language model to generate text that has certain properties. Specifically, RAD uses the reward model to score generations as they are produced and rescales sampling probabilities to favor high-reward tokens. By using a unidirectional reward model, RAD can cache activations from prior generation steps to decrease computational overhead. Through experiments on generating non-toxic and sentiment-controlled text, we demonstrate that RAD performs best among methods that change only the generation procedure and matches the performance of state-of-the-art methods that involve re-training the language model. We further validate that RAD is effective on very large language models while incurring a minimal computational overhead.

Reward-Augmented Decoding: Efficient Controlled Text Generation With a Unidirectional Reward Model

TL;DR

Controlling generative text from large language models without expensive retraining is a key challenge. Reward-Augmented Decoding (RAD) introduces a unidirectional reward model that scores candidate continuations and biases next-token sampling through a top- softmax reweighting, while caching activations for efficiency. RAD consistently outperforms prior weighted decoding methods and matches state-of-the-art retraining approaches on detoxification and sentiment tasks, with minimal overhead when the reward model is small relative to the base LM, and scales to models like LLaMA-65B. This modular decoding strategy enables practical, scalable control of very large language models for safety and attribute-oriented generation, with potential extensions to broader objectives and instruction following.

Abstract

While large language models have proven effective in a huge range of downstream applications, they often generate text that is problematic or lacks a desired attribute. In this paper, we introduce Reward-Augmented Decoding (RAD), a text generation procedure that uses a small unidirectional reward model to encourage a language model to generate text that has certain properties. Specifically, RAD uses the reward model to score generations as they are produced and rescales sampling probabilities to favor high-reward tokens. By using a unidirectional reward model, RAD can cache activations from prior generation steps to decrease computational overhead. Through experiments on generating non-toxic and sentiment-controlled text, we demonstrate that RAD performs best among methods that change only the generation procedure and matches the performance of state-of-the-art methods that involve re-training the language model. We further validate that RAD is effective on very large language models while incurring a minimal computational overhead.
Paper Structure (32 sections, 11 equations, 4 figures, 8 tables)

This paper contains 32 sections, 11 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: Reward-Augmented Decoding (RAD). RAD steers a language model towards generating text that is assigned a high reward by an auxiliary reward model. Blue/red boxes in the reward model correspond to cached/newly computed hidden states.
  • Figure 2: RAD outperforms all weighted decoding methods (round points $\bullet$ in the graph) and matches methods that involve additional training.
  • Figure 3: RAD achieves the highest positive rate for negative prompts and outperforms all weighted decoding methods.
  • Figure 4: Visualization of RAD's decoding process. Each row represents a single decoding step, where the area is the estimated reward distribution of the top-$50$ candidate sequences, and the red line indicates the selected token's reward score.