Table of Contents
Fetching ...

Alignment-Aware Decoding

Frédéric Berdoz, Luca A. Lanzendörfer, René Caky, Roger Wattenhofer

TL;DR

Alignment-Aware Decoding (AAD) targets human-preference alignment for LLMs by performing inference-time, token-level reward optimization without retraining. It treats the DPO-aligned model as a token reward function, using the log-ratio with the reference SFT model and applying a plausibility filter to avoid over-optimization, with the next token selected to maximize the reward signal within a restricted set. AAD requires only the pre-DPO reference and the post-DPO aligned models and consistently improves alignment across benchmarks and model scales; it can also generate high-quality synthetic data to boost alignment under data scarcity via iterative DPO. Empirically, AAD outperforms strong baselines, remains robust to data scarcity, and benefits from entropy-aware beam search and iterative data augmentation, offering a practical, training-light path to better aligned LLM deployments.

Abstract

Alignment of large language models remains a central challenge in natural language processing. Preference optimization has emerged as a popular and effective method for improving alignment, typically through training-time or prompt-based interventions. In this paper, we introduce alignment-aware decoding (AAD), a method to enhance model alignment directly at inference. Theoretically, AAD can be interpreted as implicit reward optimization, yet it requires no specialized training beyond the standard DPO setup. Empirically, AAD consistently outperforms strong baselines across diverse alignment benchmarks and model scales. Moreover, in data-constrained settings, AAD can produce high-quality synthetic data to improve alignment under standard decoding, providing a practical solution when labeled data is limited.

Alignment-Aware Decoding

TL;DR

Alignment-Aware Decoding (AAD) targets human-preference alignment for LLMs by performing inference-time, token-level reward optimization without retraining. It treats the DPO-aligned model as a token reward function, using the log-ratio with the reference SFT model and applying a plausibility filter to avoid over-optimization, with the next token selected to maximize the reward signal within a restricted set. AAD requires only the pre-DPO reference and the post-DPO aligned models and consistently improves alignment across benchmarks and model scales; it can also generate high-quality synthetic data to boost alignment under data scarcity via iterative DPO. Empirically, AAD outperforms strong baselines, remains robust to data scarcity, and benefits from entropy-aware beam search and iterative data augmentation, offering a practical, training-light path to better aligned LLM deployments.

Abstract

Alignment of large language models remains a central challenge in natural language processing. Preference optimization has emerged as a popular and effective method for improving alignment, typically through training-time or prompt-based interventions. In this paper, we introduce alignment-aware decoding (AAD), a method to enhance model alignment directly at inference. Theoretically, AAD can be interpreted as implicit reward optimization, yet it requires no specialized training beyond the standard DPO setup. Empirically, AAD consistently outperforms strong baselines across diverse alignment benchmarks and model scales. Moreover, in data-constrained settings, AAD can produce high-quality synthetic data to improve alignment under standard decoding, providing a practical solution when labeled data is limited.

Paper Structure

This paper contains 36 sections, 9 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Qualitative comparison of AAD againgst other decoding strategies. Greedy continuations are generated by feeding the prompt together with the current AAD prefix back into the model and greedily selecting the next token, revealing where the greedy trajectory diverges from AAD. AAD identifies the Chihuahua as the smallest recognized breed of dog, making the distinction that it refers to an officially recognized classification, whereas the other strategies simply state "breed" without that nuance. AAD is also the only method that directly addresses size (the core of the prompt) by describing height and body proportions, while greedy and best-of-2 focus mainly on weight. This highlights AAD’s advantage in preserving relevance to the prompt.
  • Figure 2: AAD versus Bo$N$. We evaluate AAD against three selection strategies on Argilla and Skywork datasets for different values of $N$: (i) Bo$N$ using the oracle, (ii) Bo$N$ using the picker, and (iii) random selection among $N$ completions. AAD remains competitive even against Bo$N$-Oracle reward model, a setting that is by design unfavorable to AAD, since the oracle is used both for Bo$N$ selection and evaluation, whereas AAD only uses a model aligned on 10% of the data. On Skywork, Bo$N$ reaches the performance of AAD for $N=4$ but requires roughly twice as much compute. On Argilla even $N=50$ fails to match AAD’s performance. The vertical dashed line indicates the point at which the computational cost of Bo$N$ matches that of our method. For the random selection baseline, we report only the mean performance across all test runs.
  • Figure 2: AAD win rate on AlpacaEval (using default evaluator) across models aligned on Skywork and Nectar. AAD consistently matches or outperforms baselines.
  • Figure 3: Performance of AAD across different training dataset sizes on the Skywork dataset. Results show that AAD consistently outperforms best-of-2 at every data scale, providing clear evidence of its robustness in low-data regimes.
  • Figure 4: Relative alignment loss of the oracle score $R$ on the Agrilla dataset as a function of the DPO regularization parameter $\beta$, with baseline performance established at $\beta = 0.05$. As expected, across all strategies, larger $\beta$ values reduce alignment, but AAD consistently shows the lowest relative loss, demonstrating greater hyperparameter robustness compared to baselines. This behavior stems from the fact that $r^*$ is $\beta$-independent, but $\pi^*$ is not, as seen in \ref{['sec:background']}.
  • ...and 7 more figures