Table of Contents
Fetching ...

One Model for All: Multi-Objective Controllable Language Models

Qiang He, Yucheng Yang, Tianyi Zhou, Meng Fang, Mykola Pechenizkiy, Setareh Maghsudi

Abstract

Aligning large language models (LLMs) with human preferences is critical for enhancing LLMs' safety, helpfulness, humor, faithfulness, etc. Current reinforcement learning from human feedback (RLHF) mainly focuses on a fixed reward learned from average human ratings, which may weaken the adaptability and controllability of varying preferences. However, creating personalized LLMs requires aligning LLMs with individual human preferences, which is non-trivial due to the scarce data per user and the diversity of user preferences in multi-objective trade-offs, varying from emphasizing empathy in certain contexts to demanding efficiency and precision in others. Can we train one LLM to produce personalized outputs across different user preferences on the Pareto front? In this paper, we introduce Multi-Objective Control (MOC), which trains a single LLM to directly generate responses in the preference-defined regions of the Pareto front. Our approach introduces multi-objective optimization (MOO) principles into RLHF to train an LLM as a preference-conditioned policy network. We improve the computational efficiency of MOC by applying MOO at the policy level, enabling us to fine-tune a 7B-parameter model on a single A6000 GPU. Extensive experiments demonstrate the advantages of MOC over baselines in three aspects: (i) controllability of LLM outputs w.r.t. user preferences on the trade-off among multiple rewards; (ii) quality and diversity of LLM outputs, measured by the hyper-volume of multiple solutions achieved; and (iii) generalization to unseen preferences. These results highlight MOC's potential for real-world applications requiring scalable and customizable LLMs.

One Model for All: Multi-Objective Controllable Language Models

Abstract

Aligning large language models (LLMs) with human preferences is critical for enhancing LLMs' safety, helpfulness, humor, faithfulness, etc. Current reinforcement learning from human feedback (RLHF) mainly focuses on a fixed reward learned from average human ratings, which may weaken the adaptability and controllability of varying preferences. However, creating personalized LLMs requires aligning LLMs with individual human preferences, which is non-trivial due to the scarce data per user and the diversity of user preferences in multi-objective trade-offs, varying from emphasizing empathy in certain contexts to demanding efficiency and precision in others. Can we train one LLM to produce personalized outputs across different user preferences on the Pareto front? In this paper, we introduce Multi-Objective Control (MOC), which trains a single LLM to directly generate responses in the preference-defined regions of the Pareto front. Our approach introduces multi-objective optimization (MOO) principles into RLHF to train an LLM as a preference-conditioned policy network. We improve the computational efficiency of MOC by applying MOO at the policy level, enabling us to fine-tune a 7B-parameter model on a single A6000 GPU. Extensive experiments demonstrate the advantages of MOC over baselines in three aspects: (i) controllability of LLM outputs w.r.t. user preferences on the trade-off among multiple rewards; (ii) quality and diversity of LLM outputs, measured by the hyper-volume of multiple solutions achieved; and (iii) generalization to unseen preferences. These results highlight MOC's potential for real-world applications requiring scalable and customizable LLMs.

Paper Structure

This paper contains 43 sections, 2 theorems, 33 equations, 11 figures, 18 tables, 2 algorithms.

Key Result

Theorem 1

Let $z(\theta) = \frac{\pi(y|x;\theta)}{\pi_{\text{old}}(y|x)}$ denote the probability ratio of PPO objective (eq: PPO obj), and let $\epsilon$ be the clipping hyper-parameter as defined in PPO PPO. The upper bound of objective in eq: original min norm of MOC is given by where where $\hat{A}_j$ denotes the advantage estimate.

Figures (11)

  • Figure 1: Solutions of MOC and Linear PPO on fishwood task and the Pareto front (line in black). MOC demonstrates advantages in both MOO (solutions lie on the Pareto front) and multi-objective control (its solutions align closely with their corresponding preference vectors, shown as the colored dashed rays). The single model trained by MOC can handle diverse preference vectors. In contrast, Linear PPO optimizes a linear scalarization of the objectives and fails to follow the preference vectors, with solutions dominated by one objective. The legend "Preference" indicates the specific weight value assigned to Reward 1 (Wood). Linear PPO is implemented by optimizing PPO w.r.t. a scalarized reward $R_{\text{lin}}=w\,R^{\text{wood}}+(1-w)\,R^{\text{fish}}$ (with $w\in[0,1]$). In our experiments, we train Linear PPO runs with weights $[0.1,0.9],[0.2,0.8],\cdots,[0.8,0.2],[0.9,0.1]$.
  • Figure 2: Controllability comparison on the Pareto front. MOC demonstrates superior controllability, indicated by the consistent ranking of solutions on their preference weights and the achieved reward values. In comparison, the baselines exhibit less stable behavior and weaker alignment with the specified preferences. MOC also achieves higher quality solutions, particularly in the Humor & Helpful alignment. Our MOC method achieves the best overall performance, supported by these results and the findings in \ref{['tab: kendallstau', 'tab: hyper volume of MOC', 'tab: entropy_comparison moc']}. Each point represents the reward achieved across multiple instances, each with a different input preference vector. Each point's preference weight for the x-axis reward is the numerical label on its marker.
  • Figure 3: Illustration of the hyper-volume concept. The hyper-volume measures the size of the objective space dominated by a set of solutions in multi-objective optimization. Larger hyper-volumes indicate better convergence and diversity of the Pareto front.
  • Figure 4: Generalization to unseen preference vectors held out from the training. MOC and RiC-trained LLMs are compared on four random sets of unseen preference vectors. Each column corresponds to a different set of unseen preference vectors, and each row represents a different pair of reward settings. MOC solutions dominate the RiC solutions in most cases. MOC's rewards align with the new preference vectors and the outputs under different preferences are diverse in the reward space. This suggests MOC generalizes to unseen preferences and achieves diverse trade-offs on the Pareto front. The size of each point indicates the standard deviation in rewards. The numerical labels indicate the preference weights (multiplied by 100) for the reward on the x-axis, enhancing visual clarity.
  • Figure 5: Visualization of four groups of randomly sampled, unseen preference vectors. Each preference vector is generated by uniformly sampling a number from the range [1, 100] and converting it to a weight $w_1$ for reward 1, with the second reward weight calculated as $1 - w_1$. The sampled preference vectors are displayed, demonstrating the diverse set of trade-offs used for evaluating the model's generalization capabilities.
  • ...and 6 more figures

Theorems & Definitions (5)

  • Theorem 1
  • Theorem 1
  • proof
  • Definition 1
  • Definition 2