Table of Contents
Fetching ...

R.I.P.: Better Models by Survival of the Fittest Prompts

Ping Yu, Weizhe Yuan, Olga Golovneva, Tianhao Wu, Sainbayar Sukhbaatar, Jason Weston, Jing Xu

TL;DR

Data quality is a key driver of instruction-following performance in LLMs. The authors introduce RIP, a data-filtering method that uses rejected-response quality and the reward gap between chosen and rejected responses to curate prompts, and Self-RIP to generate high-quality synthetic prompts. Across human-written and synthetic data, applied to Llama 3.1-8B-Instruct and Llama 3.3-70B-Instruct with DPO, RIP consistently outperforms baseline filtering methods on AlpacaEval2, Arena-Hard, and WildBench, with Self-RIP further improving results. The approach demonstrates strong generalization, reduces noisy prompts, and suggests potential safety and scalability benefits for future RLHF workflows.

Abstract

Training data quality is one of the most important drivers of final model quality. In this work, we introduce a method for evaluating data integrity based on the assumption that low-quality input prompts result in high variance and low quality responses. This is achieved by measuring the rejected response quality and the reward gap between the chosen and rejected preference pair. Our method, Rejecting Instruction Preferences (RIP) can be used to filter prompts from existing training sets, or to make high quality synthetic datasets, yielding large performance gains across various benchmarks compared to unfiltered data. Using Llama 3.1-8B-Instruct, RIP improves AlpacaEval2 LC Win Rate by 9.4%, Arena-Hard by 8.7%, and WildBench by 9.9%. Using Llama 3.3-70B-Instruct, RIP improves Arena-Hard from 67.5 to 82.9, which is from 18th place to 6th overall in the leaderboard.

R.I.P.: Better Models by Survival of the Fittest Prompts

TL;DR

Data quality is a key driver of instruction-following performance in LLMs. The authors introduce RIP, a data-filtering method that uses rejected-response quality and the reward gap between chosen and rejected responses to curate prompts, and Self-RIP to generate high-quality synthetic prompts. Across human-written and synthetic data, applied to Llama 3.1-8B-Instruct and Llama 3.3-70B-Instruct with DPO, RIP consistently outperforms baseline filtering methods on AlpacaEval2, Arena-Hard, and WildBench, with Self-RIP further improving results. The approach demonstrates strong generalization, reduces noisy prompts, and suggests potential safety and scalability benefits for future RLHF workflows.

Abstract

Training data quality is one of the most important drivers of final model quality. In this work, we introduce a method for evaluating data integrity based on the assumption that low-quality input prompts result in high variance and low quality responses. This is achieved by measuring the rejected response quality and the reward gap between the chosen and rejected preference pair. Our method, Rejecting Instruction Preferences (RIP) can be used to filter prompts from existing training sets, or to make high quality synthetic datasets, yielding large performance gains across various benchmarks compared to unfiltered data. Using Llama 3.1-8B-Instruct, RIP improves AlpacaEval2 LC Win Rate by 9.4%, Arena-Hard by 8.7%, and WildBench by 9.9%. Using Llama 3.3-70B-Instruct, RIP improves Arena-Hard from 67.5 to 82.9, which is from 18th place to 6th overall in the leaderboard.

Paper Structure

This paper contains 42 sections, 5 equations, 6 figures, 27 tables.

Figures (6)

  • Figure 1: Our method Rejecting Instruction Preferences (RIP) for curating data, and Self-RIP for creating synthetic data. The x-axis represents the effective training set size (after filtering). At every data size training on unfiltered WildChat prompts is significantly outperformed by RIP. RIP also outperforms various other curation baselines. Synthetic data built by Self-RIP improves results further.
  • Figure 2: Results on DPO Training with Varying WildChat Data Sizes. Using different sizes of WildChat data for DPO training on LLaMA 3.1-8B-Instruct, the performance, measured by Armo rewards on the validation set, gradually saturates as the data size increases.
  • Figure 3: GPT4 eval prompt.
  • Figure 4: Self-Instruct few-shot prompt template.
  • Figure 5: t-SNE plots on instructions before and after filtering by rewards and lengths of rejected responses. Red dots represent unfiltered instructions, while blue dots are instructions curated by filtering out those with low-reward and shorter rejected responses.
  • ...and 1 more figures