Table of Contents
Fetching ...

Dynamic Noise Preference Optimization for LLM Self-Improvement via Synthetic Data

Haoyan Yang, Ting Hua, Shangqian Gao, Binfeng Xu, Zheng Tang, Jie Xu, Hongxia Jin, Vijay Srinivasan

TL;DR

DNPO tackles the bottleneck of data-hungry LLM scaling by enabling self-improvement using synthetic data without heavy human labeling. It combines Dynamic Sample Labeling to adaptively form high-quality preference pairs and Noise Preference Optimization to inject trainable noise into the optimization, yielding a bi-level (or min-max) update that sustains progress across iterations. On Zephyr-7B-SFT with UltraChat data, DNPO achieves consistent gains across benchmarks, including notable improvements on TruthfulQA and ARC and a 29.4% win-rate gap against SPIN in GPT4o-mini evaluations. This approach demonstrates that carefully managed synthetic data and adaptive noise can sustain improvement across iterations and reduce dependence on human-annotated data.

Abstract

Although LLMs have achieved significant success, their reliance on large volumes of human-annotated data has limited their potential for further scaling. In this situation, utilizing self-generated synthetic data has become crucial for fine-tuning LLMs without extensive human annotation. However, current methods often fail to ensure consistent improvements across iterations, with performance stagnating after only minimal updates. To overcome these challenges, we introduce Dynamic Noise Preference Optimization (DNPO). DNPO employs a dynamic sample labeling mechanism to construct preference pairs for training and introduces controlled, trainable noise into the preference optimization process. Our approach effectively prevents stagnation and enables continuous improvement. In experiments with Zephyr-7B, DNPO consistently outperforms existing methods, showing an average performance boost of 2.6% across multiple benchmarks. Additionally, DNPO shows a significant improvement in model-generated data quality, with a 29.4% win-loss rate gap compared to the baseline in GPT-4 evaluations. This highlights its effectiveness in enhancing model performance through iterative refinement.

Dynamic Noise Preference Optimization for LLM Self-Improvement via Synthetic Data

TL;DR

DNPO tackles the bottleneck of data-hungry LLM scaling by enabling self-improvement using synthetic data without heavy human labeling. It combines Dynamic Sample Labeling to adaptively form high-quality preference pairs and Noise Preference Optimization to inject trainable noise into the optimization, yielding a bi-level (or min-max) update that sustains progress across iterations. On Zephyr-7B-SFT with UltraChat data, DNPO achieves consistent gains across benchmarks, including notable improvements on TruthfulQA and ARC and a 29.4% win-rate gap against SPIN in GPT4o-mini evaluations. This approach demonstrates that carefully managed synthetic data and adaptive noise can sustain improvement across iterations and reduce dependence on human-annotated data.

Abstract

Although LLMs have achieved significant success, their reliance on large volumes of human-annotated data has limited their potential for further scaling. In this situation, utilizing self-generated synthetic data has become crucial for fine-tuning LLMs without extensive human annotation. However, current methods often fail to ensure consistent improvements across iterations, with performance stagnating after only minimal updates. To overcome these challenges, we introduce Dynamic Noise Preference Optimization (DNPO). DNPO employs a dynamic sample labeling mechanism to construct preference pairs for training and introduces controlled, trainable noise into the preference optimization process. Our approach effectively prevents stagnation and enables continuous improvement. In experiments with Zephyr-7B, DNPO consistently outperforms existing methods, showing an average performance boost of 2.6% across multiple benchmarks. Additionally, DNPO shows a significant improvement in model-generated data quality, with a 29.4% win-loss rate gap compared to the baseline in GPT-4 evaluations. This highlights its effectiveness in enhancing model performance through iterative refinement.

Paper Structure

This paper contains 22 sections, 9 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Win rate comparison of generated data versus human-annotated data, based on GPT4o-mini's evaluation. A win indicates that generated data scored higher than human-annotated data.
  • Figure 2: This figure illustrates the log probability distributions of positive samples, negative samples in iteration $k$, and the generated data from the iteration $k+1$ model during SPIN training. The minimal differences between the generated data of iteration $k+1$ and the previous iteration $k$ indicate model stagnation during training.
  • Figure 3: This diagram illustrates the iterative training process of DNPO. There are two core components: Dynamic Sample Labeling (DSL) and Noise Preference Optimization (NPO). In each iteration $k$, DSL is responsible for generating new data from the model and labeling it by comparing it with SFT ground truth data using an evaluation model, forming preference pairs. These pairs are then passed to the NPO, which computes a probability ratio between the SFT ground truth and the generated data. NPO applies a noise-tuning strategy, where the model is frozen and the noise component is trained to minimize the margin between positive and negative sample pairs. In the following step, the noise is frozen while optimizing the model to maximize this margin. This leads to an updated model for the next iteration $k+1$.
  • Figure 4: Comparison between a human-annotated response from UltraChat-200k and a model-generated answer from Zephyr-7B after a single SPIN iteration. The ground truth misinterprets the user's intent and refuses to respond on clothes reviews. However, Zephyr-7B generates a detailed and descriptive review of a recently purchased blouse, highlighting aspects such as fit, fabric quality, color, and style.
  • Figure 5: Comparison of average benchmark scores across iterations for DNPO and SPIN. DNPO consistently improves over iterations while SPIN stagnates after the first iteration.
  • ...and 8 more figures