Table of Contents
Fetching ...

Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning

Ming Li, Lichang Chen, Jiuhai Chen, Shwai He, Heng Huang, Jiuxiang Gu, Tianyi Zhou

TL;DR

The paper tackles the data quality bottleneck in LLM instruction-tuning by introducing reflection-tuning, a data-refinement framework that uses an oracle LLM to introspect and improve both instructions and responses. Through two phases—instruction reflection and response reflection—the method generates recycled, higher-quality instruction–response pairs without requiring extra model training. Empirical results show that models trained on recycled data outperform baselines on AlpacaEval, Vicuna test sets, and Open LLM Leaderboard benchmarks, often surpassing larger open-source counterparts. The approach demonstrates data-efficient gains and broad applicability, suggesting a practical path to more reliable instruction-following LLMs.

Abstract

Recent advancements in Large Language Models (LLMs) have expanded the horizons of natural language understanding and generation. Notably, the output control and alignment with the input of LLMs can be refined through instruction tuning. However, as highlighted in several studies, low-quality data in the training set are usually detrimental to instruction tuning, resulting in inconsistent or even misleading LLM outputs. We propose a novel method, termed "reflection-tuning," which addresses the problem by self-improvement and judging capabilities of LLMs. This approach utilizes an oracle LLM to recycle the original training data by introspecting and enhancing the quality of instructions and responses in the data. Extensive experiments on widely used evaluation benchmarks show that LLMs trained with our recycled data outperform those trained with existing datasets in various benchmarks.

Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning

TL;DR

The paper tackles the data quality bottleneck in LLM instruction-tuning by introducing reflection-tuning, a data-refinement framework that uses an oracle LLM to introspect and improve both instructions and responses. Through two phases—instruction reflection and response reflection—the method generates recycled, higher-quality instruction–response pairs without requiring extra model training. Empirical results show that models trained on recycled data outperform baselines on AlpacaEval, Vicuna test sets, and Open LLM Leaderboard benchmarks, often surpassing larger open-source counterparts. The approach demonstrates data-efficient gains and broad applicability, suggesting a practical path to more reliable instruction-following LLMs.

Abstract

Recent advancements in Large Language Models (LLMs) have expanded the horizons of natural language understanding and generation. Notably, the output control and alignment with the input of LLMs can be refined through instruction tuning. However, as highlighted in several studies, low-quality data in the training set are usually detrimental to instruction tuning, resulting in inconsistent or even misleading LLM outputs. We propose a novel method, termed "reflection-tuning," which addresses the problem by self-improvement and judging capabilities of LLMs. This approach utilizes an oracle LLM to recycle the original training data by introspecting and enhancing the quality of instructions and responses in the data. Extensive experiments on widely used evaluation benchmarks show that LLMs trained with our recycled data outperform those trained with existing datasets in various benchmarks.
Paper Structure (24 sections, 3 equations, 1 figure, 3 tables)

This paper contains 24 sections, 3 equations, 1 figure, 3 tables.

Figures (1)

  • Figure 1: Comparing our recycled models with other renowned models on the Vicuna evaluation set. On the left list the models that are compared. Each bar represents a comparison between our recycled model and the other model. The red parts represent the number of wins and the green parts represent the number of loses. GPT4 is utilized as the judge.