Table of Contents
Fetching ...

DEFT: Distribution-guided Efficient Fine-Tuning for Human Alignment

Liang Zhu, Feiteng Fang, Yuelin Bai, Longze Chen, Zhexiang Zhang, Minghuan Tan, Min Yang

Abstract

Reinforcement Learning from Human Feedback (RLHF), using algorithms like Proximal Policy Optimization (PPO), aligns Large Language Models (LLMs) with human values but is costly and unstable. Alternatives have been proposed to replace PPO or integrate Supervised Fine-Tuning (SFT) and contrastive learning for direct fine-tuning and value alignment. However, these methods still require voluminous data to learn preferences and may weaken the generalization ability of LLMs. To further enhance alignment efficiency and performance while mitigating the loss of generalization ability, this paper introduces Distribution-guided Efficient Fine-Tuning (DEFT), an efficient alignment framework incorporating data filtering and distributional guidance by calculating the differential distribution reward based on the output distribution of language model and the discrepancy distribution of preference data. A small yet high-quality subset is filtered from the raw data using a differential distribution reward, which is then incorporated into existing alignment methods to guide the model's output distribution. Experimental results demonstrate that the methods enhanced by DEFT outperform the original methods in both alignment capability and generalization ability, with significantly reduced training time.

DEFT: Distribution-guided Efficient Fine-Tuning for Human Alignment

Abstract

Reinforcement Learning from Human Feedback (RLHF), using algorithms like Proximal Policy Optimization (PPO), aligns Large Language Models (LLMs) with human values but is costly and unstable. Alternatives have been proposed to replace PPO or integrate Supervised Fine-Tuning (SFT) and contrastive learning for direct fine-tuning and value alignment. However, these methods still require voluminous data to learn preferences and may weaken the generalization ability of LLMs. To further enhance alignment efficiency and performance while mitigating the loss of generalization ability, this paper introduces Distribution-guided Efficient Fine-Tuning (DEFT), an efficient alignment framework incorporating data filtering and distributional guidance by calculating the differential distribution reward based on the output distribution of language model and the discrepancy distribution of preference data. A small yet high-quality subset is filtered from the raw data using a differential distribution reward, which is then incorporated into existing alignment methods to guide the model's output distribution. Experimental results demonstrate that the methods enhanced by DEFT outperform the original methods in both alignment capability and generalization ability, with significantly reduced training time.

Paper Structure

This paper contains 32 sections, 10 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: The positive and negative distribution can be obtained by calculating word frequencies from the tokenized preference data. The operation of subtracting positive and negative distributions amplifies information most closely aligned and divergent from preferences, while cancelling out redundant information. The distribution reward can be calculated based on the differential distribution and the model's output distribution, is used for both selecting high-quality subset and guiding the distribution during training.
  • Figure 2: Data filtration is achieved through pre-computed $R_{Q}$, where responses demanding preferences of high specificity yield lower $R_{Q}$, while those unrelated to preferences receive heightened $R_{Q}$, facilitating the extraction of a dataset characterized by maximal preference information.
  • Figure 3: Augmented reference answers enhanced by ChatGPT contribute to a more reasonable calculation of BLEU and BARTScore.
  • Figure 4: In both the Harmless and Helpful aspects of human evaluations, the DEFT series demonstrates a higher win rate compared to the original method.
  • Figure 5: Changes of $\mathcal{R}_{Q}$ during the training process with and without the involvement of $\mathcal{R}_{Q}$ updates (Left). Performance on the test set across varying data volumes (Right).