Unlocking Decoding-time Controllability: Gradient-Free Multi-Objective Alignment with Contrastive Prompts
Tingchen Fu, Yupeng Hou, Julian McAuley, Rui Yan
TL;DR
The paper tackles multi-objective alignment in LLMs, where conflicting objectives like helpfulness, harmlessness, and humor resist a single optimal policy and retraining many models is costly. It proposes MCA, a gradient-free framework that constructs per-objective expert and adversarial prompts through iterative response augmentation and applies contrastive decoding at decoding time to realize user-specified trade-offs encoded by weights in the simplex, i.e., maximize $\boldsymbol{y}^* = \arg\max_{\boldsymbol{y}} \sum_{i=1}^n w_i r_i(\boldsymbol{x}, \boldsymbol{y})$. Empirical results on Phi-2 and Llama-2-7b across HH-RLHF and SafeRLHF show that MCA expands the Pareto frontier without retraining and remains compatible with prior alignment methods. The approach offers a scalable path to personalized LLM alignment by shifting control to decoding-time prompts and rewards, while dependencies on reward models and model-size limitations suggest directions for future work and broader applicability.
Abstract
The task of multi-objective alignment aims at balancing and controlling the different alignment objectives (e.g., helpfulness, harmlessness and honesty) of large language models to meet the personalized requirements of different users. However, previous methods tend to train multiple models to deal with various user preferences, with the number of trained models growing linearly with the number of alignment objectives and the number of different preferences. Meanwhile, existing methods are generally poor in extensibility and require significant re-training for each new alignment objective considered. Considering the limitation of previous approaches, we propose MCA (Multi-objective Contrastive Alignemnt), which constructs an expert prompt and an adversarial prompt for each objective to contrast at the decoding time and balances the objectives through combining the contrast. Our approach is verified to be superior to previous methods in obtaining a well-distributed Pareto front among different alignment objectives.
