Unlocking Decoding-time Controllability: Gradient-Free Multi-Objective Alignment with Contrastive Prompts

Tingchen Fu; Yupeng Hou; Julian McAuley; Rui Yan

Unlocking Decoding-time Controllability: Gradient-Free Multi-Objective Alignment with Contrastive Prompts

Tingchen Fu, Yupeng Hou, Julian McAuley, Rui Yan

TL;DR

The paper tackles multi-objective alignment in LLMs, where conflicting objectives like helpfulness, harmlessness, and humor resist a single optimal policy and retraining many models is costly. It proposes MCA, a gradient-free framework that constructs per-objective expert and adversarial prompts through iterative response augmentation and applies contrastive decoding at decoding time to realize user-specified trade-offs encoded by weights in the simplex, i.e., maximize $\boldsymbol{y}^* = \arg\max_{\boldsymbol{y}} \sum_{i=1}^n w_i r_i(\boldsymbol{x}, \boldsymbol{y})$. Empirical results on Phi-2 and Llama-2-7b across HH-RLHF and SafeRLHF show that MCA expands the Pareto frontier without retraining and remains compatible with prior alignment methods. The approach offers a scalable path to personalized LLM alignment by shifting control to decoding-time prompts and rewards, while dependencies on reward models and model-size limitations suggest directions for future work and broader applicability.

Abstract

The task of multi-objective alignment aims at balancing and controlling the different alignment objectives (e.g., helpfulness, harmlessness and honesty) of large language models to meet the personalized requirements of different users. However, previous methods tend to train multiple models to deal with various user preferences, with the number of trained models growing linearly with the number of alignment objectives and the number of different preferences. Meanwhile, existing methods are generally poor in extensibility and require significant re-training for each new alignment objective considered. Considering the limitation of previous approaches, we propose MCA (Multi-objective Contrastive Alignemnt), which constructs an expert prompt and an adversarial prompt for each objective to contrast at the decoding time and balances the objectives through combining the contrast. Our approach is verified to be superior to previous methods in obtaining a well-distributed Pareto front among different alignment objectives.

Unlocking Decoding-time Controllability: Gradient-Free Multi-Objective Alignment with Contrastive Prompts

TL;DR

. Empirical results on Phi-2 and Llama-2-7b across HH-RLHF and SafeRLHF show that MCA expands the Pareto frontier without retraining and remains compatible with prior alignment methods. The approach offers a scalable path to personalized LLM alignment by shifting control to decoding-time prompts and rewards, while dependencies on reward models and model-size limitations suggest directions for future work and broader applicability.

Abstract

Paper Structure (32 sections, 7 equations, 16 figures, 8 tables)

This paper contains 32 sections, 7 equations, 16 figures, 8 tables.

Introduction
Related Work
Language Model Alignment.
Multi-objective Alignment.
Contrastive Decoding.
Method
Problem Formulation
Iterative Prompt Construction
Preference-Aware Multiple Contrastive Decoding
Discussion.
Experiments
Backbone.
Dataset.
Reward Model.
Single-Objective Alignment
...and 17 more sections

Figures (16)

Figure 1: The correlation between helpfulness score and harmlessness score on Phi-2 generated responses on HH-RLHF (left) and SafeRLHF (right). The scores are given by objective-specific reward models.
Figure 2: The workflow of proposed MCA is composed of two major steps: iterative prompt construction and preference-aware multiple contrastive decoding.
Figure 3: The Pareto front of Phi-2 evaluated on HH-RLHF and SafeRLHF when combined with MCA.
Figure 4: The Pareto front of Llama-2-7b evaluated on HH-RLHF and SafeRLHF when combined with MCA.
Figure 5: The performance of Phi-2 in three alignment dimensions on HH-RLHF (left) and SafeRLHF (right) when combined with MCA. The reward values in three dimensions are normalized within $[0,1]$.
...and 11 more figures

Unlocking Decoding-time Controllability: Gradient-Free Multi-Objective Alignment with Contrastive Prompts

TL;DR

Abstract

Unlocking Decoding-time Controllability: Gradient-Free Multi-Objective Alignment with Contrastive Prompts

Authors

TL;DR

Abstract

Table of Contents

Figures (16)