Table of Contents
Fetching ...

Extensive Self-Contrast Enables Feedback-Free Language Model Alignment

Xiao Liu, Xixuan Song, Yuxiao Dong, Jie Tang

TL;DR

This work tackles the cost of RLHF by proposing Self-Contrast, a feedback-free LLM alignment method that exploits large volumes of self-generated negatives filtered by embeddings. The approach uses SFT targets, self-generated candidates, and Direct Preference Optimization to align models without explicit preference data, supported by a theoretical result showing that many negatives can approximate balanced annotations. Empirically, Self-Contrast outperforms SFT and standard DPO across Nectar, UltraChat, and HH-RLHF benchmarks, with performance steadily improving as the number of negatives increases. The work demonstrates practical data-efficiency gains and provides code to enable scalable, cost-effective LLM alignment.

Abstract

Reinforcement learning from human feedback (RLHF) has been a central technique for recent large language model (LLM) alignment. However, its heavy dependence on costly human or LLM-as-Judge preference feedback could stymie its wider applications. In this work, we introduce Self-Contrast, a feedback-free large language model alignment method via exploiting extensive self-generated negatives. With only supervised fine-tuning (SFT) targets, Self-Contrast leverages the LLM itself to generate massive diverse candidates, and harnesses a pre-trained embedding model to filter multiple negatives according to text similarity. Theoretically, we illustrate that in this setting, merely scaling negative responses can still effectively approximate situations with more balanced positive and negative preference annotations. Our experiments with direct preference optimization (DPO) on three datasets show that, Self-Contrast could consistently outperform SFT and standard DPO training by large margins. And as the number of self-generated negatives increases, the performance of Self-Contrast continues to grow. Code and data are available at https://github.com/THUDM/Self-Contrast.

Extensive Self-Contrast Enables Feedback-Free Language Model Alignment

TL;DR

This work tackles the cost of RLHF by proposing Self-Contrast, a feedback-free LLM alignment method that exploits large volumes of self-generated negatives filtered by embeddings. The approach uses SFT targets, self-generated candidates, and Direct Preference Optimization to align models without explicit preference data, supported by a theoretical result showing that many negatives can approximate balanced annotations. Empirically, Self-Contrast outperforms SFT and standard DPO across Nectar, UltraChat, and HH-RLHF benchmarks, with performance steadily improving as the number of negatives increases. The work demonstrates practical data-efficiency gains and provides code to enable scalable, cost-effective LLM alignment.

Abstract

Reinforcement learning from human feedback (RLHF) has been a central technique for recent large language model (LLM) alignment. However, its heavy dependence on costly human or LLM-as-Judge preference feedback could stymie its wider applications. In this work, we introduce Self-Contrast, a feedback-free large language model alignment method via exploiting extensive self-generated negatives. With only supervised fine-tuning (SFT) targets, Self-Contrast leverages the LLM itself to generate massive diverse candidates, and harnesses a pre-trained embedding model to filter multiple negatives according to text similarity. Theoretically, we illustrate that in this setting, merely scaling negative responses can still effectively approximate situations with more balanced positive and negative preference annotations. Our experiments with direct preference optimization (DPO) on three datasets show that, Self-Contrast could consistently outperform SFT and standard DPO training by large margins. And as the number of self-generated negatives increases, the performance of Self-Contrast continues to grow. Code and data are available at https://github.com/THUDM/Self-Contrast.
Paper Structure (23 sections, 18 equations, 7 figures, 5 tables)

This paper contains 23 sections, 18 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Feedback-free Self-Contrast achieves higher MT-Bench scores with only SFT prompts, targets, and self-generated negative samples without iterative training compared to previous approaches.
  • Figure 2: Scaling Self-Contrast and standard preference pairs comparison. Feedback-free Self-Contrast is competitive to training with $\times 3$ more preference feedback labels.
  • Figure 3: Self-Contrast is a self-alignment process involving three primary stages: 1. Generating mixed-quality responses by $\theta_{SFT}$. 2. Filtering out qualified negative responses using an embedding model. 3. Training $\theta_{SFT}$ with multiple negative samples against one positive sample through DPO.
  • Figure 4: Comparison of the impact of varying the quantity of negative samples on MT-Bench for Nectar, UltraChat, and HH-RLHF$_{test}$. With an increase in the number of negative samples, the performance of Self-Contrast exceeds that of the $\text{DPO}_{std}$ method using the standard preference dataset. Moreover, increasing the number of negative samples achieved the same effect as increasing preference data without increasing the number of positive samples.
  • Figure 5: The impact of parameter $a\%$ on data accuracy and performance. While a smaller $a$ leading to consistently higher accuracy of selected negatives, it harms LLM alignment when it becomes too small.
  • ...and 2 more figures

Theorems & Definitions (4)

  • Definition 1
  • Definition 2
  • Definition 3
  • Remark 1