Table of Contents
Fetching ...

Smaller Language Models Are Better Instruction Evolvers

Tingfeng Hui, Lulu Zhao, Guanting Dong, Yaqi Zhang, Hua Zhou, Sen Su

TL;DR

This work challenges the assumption that bigger language models are always better at evolving instructions, showing that smaller models can generate more effective and diverse instruction data across Evol-Instruct, AutoIF, and Auto Evol-Instruct scenarios. It analyzes why SLMs achieve this, highlighting that they maintain a broader output space and avoid overconfident token selection, thereby enabling richer instruction variants. To evaluate instruction quality without tuning, the authors introduce Instruction Complex-Aware IFD (IC-IFD), which penalizes instruction difficulty via instruction perplexity and yields more accurate assessments of instruction data. The results imply a practical shift toward using SLMs for scalable instruction data synthesis, reducing compute while improving instruction complexity and variety, with IC-IFD offering a robust evaluation tool for future work.

Abstract

Instruction tuning has been widely used to unleash the complete potential of large language models. Notably, complex and diverse instructions are of significant importance as they can effectively align models with various downstream tasks. However, current approaches to constructing large-scale instructions predominantly favour powerful models such as GPT-4 or those with over 70 billion parameters, under the empirical presumption that such larger language models (LLMs) inherently possess enhanced capabilities. In this study, we question this prevalent assumption and conduct an in-depth exploration into the potential of smaller language models (SLMs) in the context of instruction evolution. Extensive experiments across three scenarios of instruction evolution reveal that smaller language models (SLMs) can synthesize more effective instructions than LLMs. Further analysis demonstrates that SLMs possess a broader output space during instruction evolution, resulting in more complex and diverse variants. We also observe that the existing metrics fail to focus on the impact of the instructions. Thus, we propose Instruction Complex-Aware IFD (IC-IFD), which introduces instruction complexity in the original IFD score to evaluate the effectiveness of instruction data more accurately. Our source code is available at: \href{https://github.com/HypherX/Evolution-Analysis}{https://github.com/HypherX/Evolution-Analysis}

Smaller Language Models Are Better Instruction Evolvers

TL;DR

This work challenges the assumption that bigger language models are always better at evolving instructions, showing that smaller models can generate more effective and diverse instruction data across Evol-Instruct, AutoIF, and Auto Evol-Instruct scenarios. It analyzes why SLMs achieve this, highlighting that they maintain a broader output space and avoid overconfident token selection, thereby enabling richer instruction variants. To evaluate instruction quality without tuning, the authors introduce Instruction Complex-Aware IFD (IC-IFD), which penalizes instruction difficulty via instruction perplexity and yields more accurate assessments of instruction data. The results imply a practical shift toward using SLMs for scalable instruction data synthesis, reducing compute while improving instruction complexity and variety, with IC-IFD offering a robust evaluation tool for future work.

Abstract

Instruction tuning has been widely used to unleash the complete potential of large language models. Notably, complex and diverse instructions are of significant importance as they can effectively align models with various downstream tasks. However, current approaches to constructing large-scale instructions predominantly favour powerful models such as GPT-4 or those with over 70 billion parameters, under the empirical presumption that such larger language models (LLMs) inherently possess enhanced capabilities. In this study, we question this prevalent assumption and conduct an in-depth exploration into the potential of smaller language models (SLMs) in the context of instruction evolution. Extensive experiments across three scenarios of instruction evolution reveal that smaller language models (SLMs) can synthesize more effective instructions than LLMs. Further analysis demonstrates that SLMs possess a broader output space during instruction evolution, resulting in more complex and diverse variants. We also observe that the existing metrics fail to focus on the impact of the instructions. Thus, we propose Instruction Complex-Aware IFD (IC-IFD), which introduces instruction complexity in the original IFD score to evaluate the effectiveness of instruction data more accurately. Our source code is available at: \href{https://github.com/HypherX/Evolution-Analysis}{https://github.com/HypherX/Evolution-Analysis}

Paper Structure

This paper contains 51 sections, 2 equations, 20 figures, 15 tables.

Figures (20)

  • Figure 1: Comparison of performance on Llama-3-8B during three iterations of instruction evolution, using Llama-3.1-8B-Instruct and Llama-3.1-70B-Instruct as supervised models for each round under Evol-Instruct scenario.
  • Figure 2: Distribution of difficulty levels for instructions evolved during three iterations, using Llama-3.1-8B-Instruct and Llama-3.1-70B-Instruct as supervised models for each round under Evol-Instruct scenario.
  • Figure 3: Comparison of performance among Qwen-2.5 series models. Detailed results can be found in Table \ref{['tab:scaling']}.
  • Figure 4: Distribution of Minimum Neighbor Distance for instructions generated by Llama-3.1-8B-Instruct and Llama-3.1-70B-Instruct in the AutoIF scenario.
  • Figure 5: Comparison of output token probability distributions in the Evol-Instruct scenario.
  • ...and 15 more figures