Table of Contents
Fetching ...

PMoL: Parameter Efficient MoE for Preference Mixing of LLM Alignment

Dongxu Liu, Bing Xu, Yinzhuo Chen, Bufan Xu, Wenpeng Lu, Muyun Yang, Tiejun Zhao

TL;DR

Preference Mixture of LoRAs (PMoL) is proposed, which combines Mixture of Experts (MoE) and Low Rank Adaptor (LoRA) and achieves better preference alignment with lower training costs.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has been proven to be an effective method for preference alignment of large language models (LLMs) and is widely used in the post-training process of LLMs. However, RLHF struggles with handling multiple competing preferences. This leads to a decrease in the alignment of LLMs with human preferences. To address this issue, we propose Preference Mixture of LoRAs (PMoL) from the perspective of model architecture, which can adapt to any number of preferences to mix. PMoL combines Mixture of Experts (MoE) and Low Rank Adaptor (LoRA). This architecture is innovatively applied to the research of preference alignment and has achieved significant performance improvement. The expert group soft loss is used to enable MoE with the ability to mix preferences. Through comprehensive evaluation by the reward model and GPT-4o, the experiment results show that PMoL has superior preference mixing capabilities compared to baseline methods. PMoL achieves better preference alignment with lower training costs.

PMoL: Parameter Efficient MoE for Preference Mixing of LLM Alignment

TL;DR

Preference Mixture of LoRAs (PMoL) is proposed, which combines Mixture of Experts (MoE) and Low Rank Adaptor (LoRA) and achieves better preference alignment with lower training costs.

Abstract

Reinforcement Learning from Human Feedback (RLHF) has been proven to be an effective method for preference alignment of large language models (LLMs) and is widely used in the post-training process of LLMs. However, RLHF struggles with handling multiple competing preferences. This leads to a decrease in the alignment of LLMs with human preferences. To address this issue, we propose Preference Mixture of LoRAs (PMoL) from the perspective of model architecture, which can adapt to any number of preferences to mix. PMoL combines Mixture of Experts (MoE) and Low Rank Adaptor (LoRA). This architecture is innovatively applied to the research of preference alignment and has achieved significant performance improvement. The expert group soft loss is used to enable MoE with the ability to mix preferences. Through comprehensive evaluation by the reward model and GPT-4o, the experiment results show that PMoL has superior preference mixing capabilities compared to baseline methods. PMoL achieves better preference alignment with lower training costs.

Paper Structure

This paper contains 36 sections, 16 equations, 10 figures, 11 tables.

Figures (10)

  • Figure 1: Results of training full parameters (denoted as Full, with and without ZeRO-2), LoRA, and PMoL on 30,000 pieces of data with 2 epochs. The horizontal axis represents the product of GPU memory usage (GB) times time (hours). The vertical axis represents the average score. The training uses 8 GPUs.
  • Figure 2: The overview of PMoL. PMoL is an adaptor with MoE framework added to the Transformer block. All experts are divided into multiple groups. Each group of experts focuses on one preference. Additional empty expert allows the LLM to reduce its reliance on experts. During the training phase, the pre-trained parameters are frozen, and only the LoRA experts and Router are updated.
  • Figure 3: The cost of generating 100000 tokens for full parameter fine-tuning, LoRA, PMoL w and w/o parallelization. The horizontal axis represents memory usage(GB) multiplied by time(Hour).
  • Figure 4: The difficulty of mixing preferences with different Gap data. represents preference data. The top half is Chosen Response, and the bottom half is Reject Response. represents the same parts among different preferences. represents the competing parts between different preferences. Whenever there is a competing part, some preferences will inevitably be misclassified by latent reward.
  • Figure 5: Expert weights when using helpfulness (left), harmlessness (middle), and empathy (right) preference data. We calculate the total weights for experts in each preference. The horizontal axis represents the number of model layers, and the vertical axis denotes the average expert weights.
  • ...and 5 more figures