Table of Contents
Fetching ...

Scaling Towards the Information Boundary of Instruction Sets: The Infinity Instruct Subject Technical Report

Li Du, Hanyu Zhao, Yiming Ju, Tengfei Pan

TL;DR

Problem: instruction datasets often lack sufficient coverage of task types and depth of instructions, limiting instruction-following performance. Approach: a closed-loop data-construction framework that combines hierarchical tagging, informative seed selection, evolutionary data synthesis, and deficiency-driven augmentation to systematically expand coverage and depth, resulting in the InfInstruct-Sub dataset with ~1.5 million instructions. Contributions: empirical evidence that InfInstruct-Sub improves instruction-following across foundation models and benchmarks, along with analyses showing enhanced coverage and depth and a scaling law in tag connectivity. Impact: provides a practical path to continuously evolve instruction data and offers insights into the structure of instruction content that can guide more efficient training and better generalization.

Abstract

Instruction tuning has become a foundation for unlocking the capabilities of large-scale pretrained models and improving their performance on complex tasks. Thus, the construction of high-quality instruction datasets is crucial for enhancing model performance and generalizability. Although current instruction datasets have reached tens of millions of samples, models finetuned on them may still struggle with complex instruction following and tasks in rare domains. This is primarily due to limited expansion in both ``coverage'' (coverage of task types and knowledge areas) and ``depth'' (instruction complexity) of the instruction set. To address this issue, we propose a systematic instruction data construction framework, which integrates a hierarchical tagging system, an informative seed selection algorithm, an evolutionary data synthesis process, and a model deficiency diagnosis with targeted data generation. These components form an iterative closed-loop to continuously enhance the coverage and depth of instruction data. Based on this framework, we construct Infinity Instruct Subject, a high-quality dataset containing $\sim$1.5 million instructions. Experiments on multiple foundation models and benchmark tasks demonstrate its effectiveness in improving instruction-following capabilities. Further analyses suggest that Infinity Instruct Subject shows enlarged coverage and depth compared to comparable synthesized instruction datasets. Our work lays a theoretical and practical foundation for the efficient, continuous evolution of instruction datasets, moving from data quantity expansion to qualitative improvement.

Scaling Towards the Information Boundary of Instruction Sets: The Infinity Instruct Subject Technical Report

TL;DR

Problem: instruction datasets often lack sufficient coverage of task types and depth of instructions, limiting instruction-following performance. Approach: a closed-loop data-construction framework that combines hierarchical tagging, informative seed selection, evolutionary data synthesis, and deficiency-driven augmentation to systematically expand coverage and depth, resulting in the InfInstruct-Sub dataset with ~1.5 million instructions. Contributions: empirical evidence that InfInstruct-Sub improves instruction-following across foundation models and benchmarks, along with analyses showing enhanced coverage and depth and a scaling law in tag connectivity. Impact: provides a practical path to continuously evolve instruction data and offers insights into the structure of instruction content that can guide more efficient training and better generalization.

Abstract

Instruction tuning has become a foundation for unlocking the capabilities of large-scale pretrained models and improving their performance on complex tasks. Thus, the construction of high-quality instruction datasets is crucial for enhancing model performance and generalizability. Although current instruction datasets have reached tens of millions of samples, models finetuned on them may still struggle with complex instruction following and tasks in rare domains. This is primarily due to limited expansion in both ``coverage'' (coverage of task types and knowledge areas) and ``depth'' (instruction complexity) of the instruction set. To address this issue, we propose a systematic instruction data construction framework, which integrates a hierarchical tagging system, an informative seed selection algorithm, an evolutionary data synthesis process, and a model deficiency diagnosis with targeted data generation. These components form an iterative closed-loop to continuously enhance the coverage and depth of instruction data. Based on this framework, we construct Infinity Instruct Subject, a high-quality dataset containing 1.5 million instructions. Experiments on multiple foundation models and benchmark tasks demonstrate its effectiveness in improving instruction-following capabilities. Further analyses suggest that Infinity Instruct Subject shows enlarged coverage and depth compared to comparable synthesized instruction datasets. Our work lays a theoretical and practical foundation for the efficient, continuous evolution of instruction datasets, moving from data quantity expansion to qualitative improvement.

Paper Structure

This paper contains 23 sections, 5 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Construction pipeline of the InfInstruct-Sub dataset.
  • Figure 1: Classification standard for categorizing a fine-grained tag into a domain-tag.
  • Figure 2: Tagging system of the InfInstruct-Sub dataset for elucidating content distribution of instruction pools. (a) Fine-grained tags and normalization of fine-grained tags. (b) Construction of categorical tags and the process of mapping fine-grained tags to categorical tags.
  • Figure 3: Prompt used for guiding the LLM to generate tags for given instruction.
  • Figure 4: Prompt for categorizing a fine-grained tag into a domain-tag.
  • ...and 11 more figures