Table of Contents
Fetching ...

Refine-n-Judge: Curating High-Quality Preference Chains for LLM-Fine-Tuning

Derin Cayir, Renjie Tao, Rashi Rungta, Kai Sun, Sean Chen, Haidar Khan, Minseok Kim, Julia Reinspach, Yue Liu

TL;DR

This work tackles the data bottleneck in preference-based fine-tuning of LLMs by introducing Refine-n-Judge, a fully automated loop where a single LLM both refines outputs and judges improvements. The method generates sequences of increasingly high-quality, preference-labeled responses without human annotations or a separate reward model, enabling scalable dataset curation. Across five corpora and with Llama 3 models, Refine-n-Judge achieved strong judge-based preference gains (over 74% wins) and yielded notable fine-tuning improvements on AlpacaEval, AlpacaEval 2.0, and MT-Bench (+5%, +5%, +19%, respectively). The approach demonstrates robustness to noisy data and establishes a scalable, human-free pathway for producing high-quality preference datasets to enhance LLM alignment and capabilities, while acknowledging limitations in judge consistency and ethical considerations.

Abstract

Large Language Models (LLMs) have demonstrated remarkable progress through preference-based fine-tuning, which critically depends on the quality of the underlying training data. While human feedback is essential for improving data quality, it is costly and does not scale well. In this paper, we introduce Refine-n-Judge, an automated iterative approach that leverages a single LLM as both a refiner and a judge to enhance dataset quality. Unlike existing iterative refinement methods, Refine-n-Judge employs an LLM to both generate refinements and explicitly evaluate each improvement, ensuring that every iteration meaningfully enhances the dataset without requiring additional human annotation or a separate reward model. At each step, the LLM refines a response and judges whether the refinement is an improvement over the previous answer. This process continues until the LLM prefers the initial answer over the refinement, indicating no further improvements. This produces sequences of increasing quality, preference-labeled responses ideal for fine-tuning. We demonstrate the effectiveness of Refine-n-Judge across a range of public datasets spanning five corpora, targeting tasks such as coding, math, and conversation. Models (Llama 3.1-8B and Llama 3.3-70B) fine-tuned on Refine-n-Judge-enhanced datasets were preferred by LLM judges in over 74% of comparisons against models tuned on the original dataset by GPT-4. Additionally, we report performance gains: +5% on AlpacaEval and AlpacaEval 2.0, and +19% on MT-Bench. Our results indicate that Refine-n-Judge produces high-quality datasets and scalable model improvements.

Refine-n-Judge: Curating High-Quality Preference Chains for LLM-Fine-Tuning

TL;DR

This work tackles the data bottleneck in preference-based fine-tuning of LLMs by introducing Refine-n-Judge, a fully automated loop where a single LLM both refines outputs and judges improvements. The method generates sequences of increasingly high-quality, preference-labeled responses without human annotations or a separate reward model, enabling scalable dataset curation. Across five corpora and with Llama 3 models, Refine-n-Judge achieved strong judge-based preference gains (over 74% wins) and yielded notable fine-tuning improvements on AlpacaEval, AlpacaEval 2.0, and MT-Bench (+5%, +5%, +19%, respectively). The approach demonstrates robustness to noisy data and establishes a scalable, human-free pathway for producing high-quality preference datasets to enhance LLM alignment and capabilities, while acknowledging limitations in judge consistency and ethical considerations.

Abstract

Large Language Models (LLMs) have demonstrated remarkable progress through preference-based fine-tuning, which critically depends on the quality of the underlying training data. While human feedback is essential for improving data quality, it is costly and does not scale well. In this paper, we introduce Refine-n-Judge, an automated iterative approach that leverages a single LLM as both a refiner and a judge to enhance dataset quality. Unlike existing iterative refinement methods, Refine-n-Judge employs an LLM to both generate refinements and explicitly evaluate each improvement, ensuring that every iteration meaningfully enhances the dataset without requiring additional human annotation or a separate reward model. At each step, the LLM refines a response and judges whether the refinement is an improvement over the previous answer. This process continues until the LLM prefers the initial answer over the refinement, indicating no further improvements. This produces sequences of increasing quality, preference-labeled responses ideal for fine-tuning. We demonstrate the effectiveness of Refine-n-Judge across a range of public datasets spanning five corpora, targeting tasks such as coding, math, and conversation. Models (Llama 3.1-8B and Llama 3.3-70B) fine-tuned on Refine-n-Judge-enhanced datasets were preferred by LLM judges in over 74% of comparisons against models tuned on the original dataset by GPT-4. Additionally, we report performance gains: +5% on AlpacaEval and AlpacaEval 2.0, and +19% on MT-Bench. Our results indicate that Refine-n-Judge produces high-quality datasets and scalable model improvements.

Paper Structure

This paper contains 25 sections, 7 figures, 7 tables.

Figures (7)

  • Figure 1: The Refine-and-Judge process. Starting with a query and an initial answer, the LLM refines the answer based on generated feedback. It then compares the refined answer with the initial one and selects the preferred version. If the refined answer is preferred, it becomes the new initial answer; otherwise, the process ends.
  • Figure 2: Example of Refine-n-Judge. Beginning with an initial model-generated output, each successive response is produced by prompting the model to refine the previous one, and the final selected response is highlighted in gray.
  • Figure 3: Win % of the Refine-n-Judge pipeline over varying numbers of refinement iterations. The win rate reflects how often the Refine-n-Judge pipeline is preferred over a pipeline with only a refiner, as evaluated by GPT-4. Each experiment was repeated three times.
  • Figure 4: Illustrative example of iterative refinement pipeline without a stopping condition.
  • Figure 5: Pairwise win rate comparison across iterations for two pipelines: (a) LLM-Refine only, proposed in Self-Refine madaan2023self, and (b) Refine-n-Judge. For the Refine-n-Judge pipeline, the matrix spans up to $Ans_5$ due to the decreasing number of instances where the refinement count exceeds 5.
  • ...and 2 more figures