Table of Contents
Fetching ...

Progressively Label Enhancement for Large Language Model Alignment

Biao Liu, Ning Xu, Xin Geng

TL;DR

Progressively Label Enhancement for LLM Alignment is a framework that dynamically adjusts the model's training process based on the evolving quality of the generated data, demonstrating the effectiveness of PLE compared to existing LLM alignment methods.

Abstract

Large Language Models (LLM) alignment aims to prevent models from producing content that misaligns with human expectations, which can lead to ethical and legal concerns. In the last few years, Reinforcement Learning from Human Feedback (RLHF) has been the most prominent method for achieving alignment. Due to challenges in stability and scalability with RLHF stages, which arise from the complex interactions between multiple models, researchers are exploring alternative methods to achieve effects comparable to those of RLHF. However, these methods often rely on large high-quality datasets. Despite some methods considering the generation of additional data to expand datasets, they often treat model training and data generation as separate and static processes, overlooking the fact that these processes are highly interdependent, leading to inefficient utilization of the generated data. To deal with this problem, we propose PLE, i.e., Progressively Label Enhancement for LLM Alignment, a framework that dynamically adjusts the model's training process based on the evolving quality of the generated data. Specifically, we prompt the model to generate responses for both the original query and the query guided by a set of carefully designed principles, and then utilize a dynamic threshold to determine the appropriate training approach for both responses based on their corresponding reward scores. Experimental results demonstrate the effectiveness of PLE compared to existing LLM alignment methods.

Progressively Label Enhancement for Large Language Model Alignment

TL;DR

Progressively Label Enhancement for LLM Alignment is a framework that dynamically adjusts the model's training process based on the evolving quality of the generated data, demonstrating the effectiveness of PLE compared to existing LLM alignment methods.

Abstract

Large Language Models (LLM) alignment aims to prevent models from producing content that misaligns with human expectations, which can lead to ethical and legal concerns. In the last few years, Reinforcement Learning from Human Feedback (RLHF) has been the most prominent method for achieving alignment. Due to challenges in stability and scalability with RLHF stages, which arise from the complex interactions between multiple models, researchers are exploring alternative methods to achieve effects comparable to those of RLHF. However, these methods often rely on large high-quality datasets. Despite some methods considering the generation of additional data to expand datasets, they often treat model training and data generation as separate and static processes, overlooking the fact that these processes are highly interdependent, leading to inefficient utilization of the generated data. To deal with this problem, we propose PLE, i.e., Progressively Label Enhancement for LLM Alignment, a framework that dynamically adjusts the model's training process based on the evolving quality of the generated data. Specifically, we prompt the model to generate responses for both the original query and the query guided by a set of carefully designed principles, and then utilize a dynamic threshold to determine the appropriate training approach for both responses based on their corresponding reward scores. Experimental results demonstrate the effectiveness of PLE compared to existing LLM alignment methods.
Paper Structure (15 sections, 2 theorems, 18 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 15 sections, 2 theorems, 18 equations, 4 figures, 2 tables, 1 algorithm.

Key Result

Lemma 5.2

For a given language model $\pi$, there exists a pure $L(e, \pi, R)$-level set. For query $\bm x \in \mathcal{D}_{\text{query}}$, if $\pi(\bm y^\text{prompt}|\bm x) - \pi(\bm y|\bm x) > e$, we add the instance-responses pair into the preference dataset $\mathcal{D}_{\text{train}}$ for calculating ra

Figures (4)

  • Figure 1: Comparison of language model responses with and without principle guidance. (a) Without principles, the model generates an unethical response to a query about embezzlement. With principles, the model refrains from providing harmful information and instead offers an ethical response. (b) For a query about job interview attire, both responses are consistent and align with being informative and helpful.
  • Figure 2: Win rates of the model responses vs other baselines evaluated by Claude Sonnet API and human annotators. Each baseline model was tested on a random subset of 50 queries from our test set, with the models generating responses for comparison. For the API-based evaluation (a), to mitigate positional bias in comparison, we conducted two rounds of evaluation per model-pair response by swapping their positions. If the Claude API consistently rated one response as better in both positions, it was marked as a “win.” If it rated one better only once, it was classified as a “tie.” Otherwise, the result was deemed a “lose.” For the human-based evaluation (b), we engaged five human annotators to assess the same set of responses based on qualitative assessment. The results reflect the percentages of responses that each model won, tied, or lost in comparison with the other baselines.
  • Figure 3: Reward curve of principle-guided responses and original responses on the HH dataset.
  • Figure 4: Model's responses to ethical and productivity-related queries. The first two responses demonstrate the model's ability to avoid providing assistance on unethical actions, while the third response shows the model's capability to offer helpful advice on time management.

Theorems & Definitions (3)

  • Definition 5.1
  • Lemma 5.2
  • Theorem 5.3