Table of Contents
Fetching ...

On-the-fly Preference Alignment via Principle-Guided Decoding

Mingye Zhu, Yi Liu, Lei Zhang, Junbo Guo, Zhendong Mao

TL;DR

OPAD introduces On-the-fly Preference Alignment via Principle-Guided Decoding, a tuning-free framework that aligns model outputs with human principles during inference by maximizing a principle-guided reward under a KL constraint between the constrained and unconstrained policies. The reward is a token-level, sequential KL divergence term that captures residual alignment, enabling a per-token update: $p_{\theta}(\mathbf{y}_t|\mathbf{x},c,\mathbf{y}_{<t}) = \frac{1}{Z} \pi_{\theta}(\mathbf{y}_t|\mathbf{x},c,\mathbf{y}_{<t}) \exp( r_{\pi_{\theta}}(\mathbf{x},\mathbf{y}_{<t},c)/\beta )$, with a tractable partition function $Z$. Empirical results across general and personalized alignment tasks show OPAD can be competitive or superior to RLHF- and decoding-based baselines, while producing a larger token-level distribution shift that reflects stronger principle adherence and adjustable via $\beta$. The approach offers a scalable, decoding-time method to tailor outputs to diverse principles without fine-tuning, though it relies on KL-based rewards and may risk overfitting or rigidity in ambiguous contexts.

Abstract

With the rapidly expanding landscape of large language models, aligning model generations with human values and preferences is becoming increasingly important. Popular alignment methods, such as Reinforcement Learning from Human Feedback, have shown significant success in guiding models with greater control. However, these methods require considerable computational resources, which is inefficient, and substantial collection of training data to accommodate the diverse and pluralistic nature of human preferences, which is impractical. These limitations significantly constrain the scope and efficacy of both task-specific and general preference alignment methods. In this work, we introduce On-the-fly Preference Alignment via Principle-Guided Decoding (OPAD) to directly align model outputs with human preferences during inference, eliminating the need for fine-tuning. Our approach involves first curating a surrogate solution to an otherwise infeasible optimization problem and then designing a principle-guided reward function based on this surrogate. The final aligned policy is derived by maximizing this customized reward, which exploits the discrepancy between the constrained policy and its unconstrained counterpart. OPAD directly modifies the model's predictions during inference, ensuring principle adherence without incurring the computational overhead of retraining or fine-tuning. Experiments show that OPAD achieves competitive or superior performance in both general and personalized alignment tasks, demonstrating its efficiency and effectiveness compared to state-of-the-art baselines.

On-the-fly Preference Alignment via Principle-Guided Decoding

TL;DR

OPAD introduces On-the-fly Preference Alignment via Principle-Guided Decoding, a tuning-free framework that aligns model outputs with human principles during inference by maximizing a principle-guided reward under a KL constraint between the constrained and unconstrained policies. The reward is a token-level, sequential KL divergence term that captures residual alignment, enabling a per-token update: , with a tractable partition function . Empirical results across general and personalized alignment tasks show OPAD can be competitive or superior to RLHF- and decoding-based baselines, while producing a larger token-level distribution shift that reflects stronger principle adherence and adjustable via . The approach offers a scalable, decoding-time method to tailor outputs to diverse principles without fine-tuning, though it relies on KL-based rewards and may risk overfitting or rigidity in ambiguous contexts.

Abstract

With the rapidly expanding landscape of large language models, aligning model generations with human values and preferences is becoming increasingly important. Popular alignment methods, such as Reinforcement Learning from Human Feedback, have shown significant success in guiding models with greater control. However, these methods require considerable computational resources, which is inefficient, and substantial collection of training data to accommodate the diverse and pluralistic nature of human preferences, which is impractical. These limitations significantly constrain the scope and efficacy of both task-specific and general preference alignment methods. In this work, we introduce On-the-fly Preference Alignment via Principle-Guided Decoding (OPAD) to directly align model outputs with human preferences during inference, eliminating the need for fine-tuning. Our approach involves first curating a surrogate solution to an otherwise infeasible optimization problem and then designing a principle-guided reward function based on this surrogate. The final aligned policy is derived by maximizing this customized reward, which exploits the discrepancy between the constrained policy and its unconstrained counterpart. OPAD directly modifies the model's predictions during inference, ensuring principle adherence without incurring the computational overhead of retraining or fine-tuning. Experiments show that OPAD achieves competitive or superior performance in both general and personalized alignment tasks, demonstrating its efficiency and effectiveness compared to state-of-the-art baselines.

Paper Structure

This paper contains 23 sections, 1 theorem, 17 equations, 7 figures, 7 tables, 1 algorithm.

Key Result

Proposition 1

Suppose we have a target principle $c$, maximizing the KL divergence between the constrained policy ${\cal P}_c$ and the unconstrained policy ${\cal P}$ serves as a surrogate for minimizing the KL divergence between the true data distribution ${\cal P}_{\textrm{data}}$ and ${\cal P}_c$: under the following conditions:

Figures (7)

  • Figure 1: Given a query and principle, OPAD offered a more poetic and eloquent response (befitting a charismatic poet), whereas prompting with the principle presents a direct answer, failing to follow the principle to act as a poet.
  • Figure 2: OPAD overview. Given user query $\mathbf{x}$ and principle $c$, OPAD computes a principle-guided reward $r_{\theta}(\mathbf{x},\mathbf{y}_{<t},c)$ leveraging the divergence between the constrained probability distribution and its unconstrained counterpart. This reward quantifies the alignment between the current prediction and the principle $c$, and the final aligned policy $p_{\theta}$ is derived by maximizing this reward.
  • Figure 3: Direct comparison of OPAD with the baselines on personalized alignment tasks. Dark blue means the percentage of cases where OPAD wins over the baseline, evaluated by GPT4-Turbo. Experiments show that OPAD substantially outperforms all the baselines, better addressing diverse user preferences.
  • Figure 4: Performance trend of OPAD with PP when model scales up.
  • Figure 5: Token distribution shift remains pronounced during decoding for OPAD.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Proposition 1