Table of Contents
Fetching ...

Panacea: Pareto Alignment via Preference Adaptation for LLMs

Yifan Zhong, Chengdong Ma, Xiaoyuan Zhang, Ziran Yang, Haojun Chen, Qingfu Zhang, Siyuan Qi, Yaodong Yang

TL;DR

Panacea reframes LLM alignment as multi-dimensional preference optimization (MDPO) and trains a single model to recover the entire Pareto front across many human preferences. It embeds a low-dimensional preference vector into the singular values of every SVD-LoRA layer, enabling online, Pareto-optimal adaptation with no per-vector fine-tuning. The authors prove that aggregating per-dimension objectives recovers the full PF under mild assumptions and demonstrate scalability up to ten dimensions with convex, well-distributed fronts that outperform baselines. This approach offers fine-grained, efficient, and controllable alignment for LLMs across diverse user needs.

Abstract

Current methods for large language model alignment typically use scalar human preference labels. However, this convention tends to oversimplify the multi-dimensional and heterogeneous nature of human preferences, leading to reduced expressivity and even misalignment. This paper presents Panacea, an innovative approach that reframes alignment as a multi-dimensional preference optimization problem. Panacea trains a single model capable of adapting online and Pareto-optimally to diverse sets of preferences without the need for further tuning. A major challenge here is using a low-dimensional preference vector to guide the model's behavior, despite it being governed by an overwhelmingly large number of parameters. To address this, Panacea is designed to use singular value decomposition (SVD)-based low-rank adaptation, which allows the preference vector to be simply injected online as singular values. Theoretically, we prove that Panacea recovers the entire Pareto front with common loss aggregation methods under mild conditions. Moreover, our experiments demonstrate, for the first time, the feasibility of aligning a single LLM to represent an exponentially vast spectrum of human preferences through various optimization methods. Our work marks a step forward in effectively and efficiently aligning models to diverse and intricate human preferences in a controllable and Pareto-optimal manner.

Panacea: Pareto Alignment via Preference Adaptation for LLMs

TL;DR

Panacea reframes LLM alignment as multi-dimensional preference optimization (MDPO) and trains a single model to recover the entire Pareto front across many human preferences. It embeds a low-dimensional preference vector into the singular values of every SVD-LoRA layer, enabling online, Pareto-optimal adaptation with no per-vector fine-tuning. The authors prove that aggregating per-dimension objectives recovers the full PF under mild assumptions and demonstrate scalability up to ten dimensions with convex, well-distributed fronts that outperform baselines. This approach offers fine-grained, efficient, and controllable alignment for LLMs across diverse user needs.

Abstract

Current methods for large language model alignment typically use scalar human preference labels. However, this convention tends to oversimplify the multi-dimensional and heterogeneous nature of human preferences, leading to reduced expressivity and even misalignment. This paper presents Panacea, an innovative approach that reframes alignment as a multi-dimensional preference optimization problem. Panacea trains a single model capable of adapting online and Pareto-optimally to diverse sets of preferences without the need for further tuning. A major challenge here is using a low-dimensional preference vector to guide the model's behavior, despite it being governed by an overwhelmingly large number of parameters. To address this, Panacea is designed to use singular value decomposition (SVD)-based low-rank adaptation, which allows the preference vector to be simply injected online as singular values. Theoretically, we prove that Panacea recovers the entire Pareto front with common loss aggregation methods under mild conditions. Moreover, our experiments demonstrate, for the first time, the feasibility of aligning a single LLM to represent an exponentially vast spectrum of human preferences through various optimization methods. Our work marks a step forward in effectively and efficiently aligning models to diverse and intricate human preferences in a controllable and Pareto-optimal manner.
Paper Structure (28 sections, 6 theorems, 22 equations, 16 figures, 4 tables)

This paper contains 28 sections, 6 theorems, 22 equations, 16 figures, 4 tables.

Key Result

Theorem 4.1

Panacea recovers the entire Pareto front for both LS and Tche aggregation functions (eqn:lseqn:tche) under the following assumptions: 1. Panacea with SVD-LoRA has sufficient representation capability for all preferences ${\bm{\lambda}} \in \Delta_m$. Specifically, for any preference vector ${\bm{\la

Figures (16)

  • Figure 1: Comparison of the predominant single-objective alignment and our multi-dimensional alignment. For the two responses to a prompt, labelers agree on the preferable one in each preference dimension, but conflict when assigning a synthesized scalar label denoting which is "better". This arises due to the inherently different preference weights held by labelers, a common case in reality. Performing single-objective optimization on the potentially conflicting scalar-label dataset (left) could lead to a dominated solution and misalignment. By contrast, our method, Panacea, leverages multi-dimensional preference optimization (right) on the consistent multi-dimensional dataset and learns the entire Pareto front (PF), thereby aligning with diverse and complex human preferences.
  • Figure 2: Panacea embeds the preference vector into singular values of each SVD-LoRA layer and scales it with learnable factors to match the magnitudes. During learning, for each data batch, we randomly sample a preference vector from the preference simplex and train the embedded model with various optimization procedures and loss aggregation methods. In the inference stage, the model adapts online to the user-specified preference vector and exhibits Pareto alignment in its responses.
  • Figure 3: Algorithm performance on HH. Baseline methods (RS and DPS) require training a separate model for each preference dimension/vector, whereas Panacea learns a single adaptable model. Left: Panacea is significantly better than RS and even outperforms DPS, showing its superiority in learning PF while being more efficient. Middle: on Llama2-ft across different seeds, Panacea again consistently outperforms RS, and its fronts exhibit smooth convex shapes that correspond with theory. Right: with DPO, Panacea using both LS and Tche aggregation learns better fronts than RS.
  • Figure 4: Responses of the model to the same user prompt with two extreme preference vectors. Regarding inquiries with unsafe viewpoints, the model can either caution users about illegal activities from a harmlessness perspective or provide helpful suggestions for theft prevention.
  • Figure 5: Learned fronts of Panacea (red) and RS (blue) on HHC problem with Llama2-ft, RLHF, and LS aggregation. Panacea learns a better and more evenly distributed front while solutions of RS clutter in a corner. This suggests Panacea provides fine-grained solutions to diverse human preferences.
  • ...and 11 more figures

Theorems & Definitions (13)

  • Definition 3.1: Pareto optimality
  • Theorem 4.1
  • Lemma A.1: Extension of Lemma 2 in rafailov2023direct for multiple reward models
  • Remark A.2
  • proof
  • Theorem B.4
  • proof
  • Corollary C.2
  • Lemma C.3: Convex space Lemma, adapted from hu2024revisiting(Eq. 13)
  • Definition C.4: Convex Coverage Set (CCS), adapted from roijers2015computing(Def. 9)
  • ...and 3 more