Table of Contents
Fetching ...

Beyond Sample-Level Feedback: Using Reference-Level Feedback to Guide Data Synthesis

Shuhaib Mehri, Xiusi Chen, Heng Ji, Dilek Hakkani-Tür

TL;DR

This paper introduces Reference-Level Feedback (RLF), a data-synthesis paradigm that extracts desirable traits from high-quality reference samples and propagates them to newly generated instruction-response pairs, thereby surpassing the quality ceiling typical of generator-produced data. Applying this approach to create REFED, a 10K instruction-response dataset, the authors demonstrate that fine-tuning 8B-parameter and 7B-parameter models on REFED yields state-of-the-art performance among similarly sized models on AlpacaEval 2.0 and Arena-Hard, with a notable 43.96% length-controlled win rate. Through extensive experiments, they show that REFED consistently outperforms traditional sample-level feedback, generalizes across architectures, achieves higher data diversity, and remains cost-efficient (synthesizing 10K samples for under $20 with GPT-4o mini). The work also analyzes data-filtering strategies and scalability, arguing that seed-quality signals can meaningfully improve data synthesis while maintaining practicality for large-scale deployment.

Abstract

High-quality instruction-tuning data is crucial for developing Large Language Models (LLMs) that can effectively navigate real-world tasks and follow human instructions. While synthetic data generation offers a scalable approach for creating such datasets, it imposes a quality ceiling where models trained on the data cannot outperform the LLM generating it. To overcome this limitation, we introduce Reference-Level Feedback, a paradigm that extracts desirable characteristics from carefully curated reference samples to guide the synthesis of higher-quality instruction-response pairs. Using this approach, we synthesize REFED, a dataset of 10K instruction-response pairs. Fine-tuning Llama-3.1-8B-Instruct and Mistral-7B-Instruct on REFED demonstrate state-of-the-art performance among similarly sized models, notably reaching a 43.96\% length-controlled win-rate on AlpacaEval 2.0. Extensive experiments demonstrate that Reference-Level Feedback consistently outperforms traditional sample-level feedback methods, generalizes across model architectures, and produces high-quality and diverse data at low cost.

Beyond Sample-Level Feedback: Using Reference-Level Feedback to Guide Data Synthesis

TL;DR

This paper introduces Reference-Level Feedback (RLF), a data-synthesis paradigm that extracts desirable traits from high-quality reference samples and propagates them to newly generated instruction-response pairs, thereby surpassing the quality ceiling typical of generator-produced data. Applying this approach to create REFED, a 10K instruction-response dataset, the authors demonstrate that fine-tuning 8B-parameter and 7B-parameter models on REFED yields state-of-the-art performance among similarly sized models on AlpacaEval 2.0 and Arena-Hard, with a notable 43.96% length-controlled win rate. Through extensive experiments, they show that REFED consistently outperforms traditional sample-level feedback, generalizes across architectures, achieves higher data diversity, and remains cost-efficient (synthesizing 10K samples for under $20 with GPT-4o mini). The work also analyzes data-filtering strategies and scalability, arguing that seed-quality signals can meaningfully improve data synthesis while maintaining practicality for large-scale deployment.

Abstract

High-quality instruction-tuning data is crucial for developing Large Language Models (LLMs) that can effectively navigate real-world tasks and follow human instructions. While synthetic data generation offers a scalable approach for creating such datasets, it imposes a quality ceiling where models trained on the data cannot outperform the LLM generating it. To overcome this limitation, we introduce Reference-Level Feedback, a paradigm that extracts desirable characteristics from carefully curated reference samples to guide the synthesis of higher-quality instruction-response pairs. Using this approach, we synthesize REFED, a dataset of 10K instruction-response pairs. Fine-tuning Llama-3.1-8B-Instruct and Mistral-7B-Instruct on REFED demonstrate state-of-the-art performance among similarly sized models, notably reaching a 43.96\% length-controlled win-rate on AlpacaEval 2.0. Extensive experiments demonstrate that Reference-Level Feedback consistently outperforms traditional sample-level feedback methods, generalizes across model architectures, and produces high-quality and diverse data at low cost.

Paper Structure

This paper contains 41 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Comparison of feedback approaches for data synthesis. Left: Traditional sample-level feedback generates and applies feedback individually for each sample. Right: Our Reference-Level Feedback approach collects feedback once from a high-quality reference sample and applies it multiple new samples.
  • Figure 2: An overview of our data synthesis pipeline. Starting from our seed data, we select a reference sample and collect Reference-Level Feedback on both the instruction and response. Instruction feedback is used to synthesize new instructions. We generate corresponding responses, and then improve it using response feedback.
  • Figure 3: Length Controlled Win-Rate on AlpacaEval 2.0 for Llama-3.1-8B-Instruct finetuned on various subsets of REFED, based on different filtering strategies.