Beyond Sample-Level Feedback: Using Reference-Level Feedback to Guide Data Synthesis
Shuhaib Mehri, Xiusi Chen, Heng Ji, Dilek Hakkani-Tür
TL;DR
This paper introduces Reference-Level Feedback (RLF), a data-synthesis paradigm that extracts desirable traits from high-quality reference samples and propagates them to newly generated instruction-response pairs, thereby surpassing the quality ceiling typical of generator-produced data. Applying this approach to create REFED, a 10K instruction-response dataset, the authors demonstrate that fine-tuning 8B-parameter and 7B-parameter models on REFED yields state-of-the-art performance among similarly sized models on AlpacaEval 2.0 and Arena-Hard, with a notable 43.96% length-controlled win rate. Through extensive experiments, they show that REFED consistently outperforms traditional sample-level feedback, generalizes across architectures, achieves higher data diversity, and remains cost-efficient (synthesizing 10K samples for under $20 with GPT-4o mini). The work also analyzes data-filtering strategies and scalability, arguing that seed-quality signals can meaningfully improve data synthesis while maintaining practicality for large-scale deployment.
Abstract
High-quality instruction-tuning data is crucial for developing Large Language Models (LLMs) that can effectively navigate real-world tasks and follow human instructions. While synthetic data generation offers a scalable approach for creating such datasets, it imposes a quality ceiling where models trained on the data cannot outperform the LLM generating it. To overcome this limitation, we introduce Reference-Level Feedback, a paradigm that extracts desirable characteristics from carefully curated reference samples to guide the synthesis of higher-quality instruction-response pairs. Using this approach, we synthesize REFED, a dataset of 10K instruction-response pairs. Fine-tuning Llama-3.1-8B-Instruct and Mistral-7B-Instruct on REFED demonstrate state-of-the-art performance among similarly sized models, notably reaching a 43.96\% length-controlled win-rate on AlpacaEval 2.0. Extensive experiments demonstrate that Reference-Level Feedback consistently outperforms traditional sample-level feedback methods, generalizes across model architectures, and produces high-quality and diverse data at low cost.
