Table of Contents
Fetching ...

Infinity Instruct: Scaling Instruction Selection and Synthesis to Enhance Language Models

Jijie Li, Li Du, Hanyu Zhao, Bo-wen Zhang, Liangdong Wang, Boyan Gao, Guang Liu, Yonghua Lin

TL;DR

Infinity-Instruct presents a scalable, data-centric pipeline to construct high-quality open-source instruction datasets by combining data selection and synthesis. The two-phase design yields InfInstruct-F-7.4M (foundational) and InfInstruct-G-1.5M (conversational), with rigorous deduplication and contamination filtering. Fine-tuning multiple open-source LLMs on these datasets achieves state-of-the-art performance on both foundational and instruction-following benchmarks, with Llama3.1-70B surpassing GPT-4 in chat while matching foundational tasks. The work highlights a strong synergy between foundational and conversational training and demonstrates the potential of open datasets to narrow the gap with proprietary models. Public release of datasets and code further enables broad adoption and continued advancement in LLM alignment and generalization.

Abstract

Large Language Models (LLMs) demonstrate strong performance in real-world applications, yet existing open-source instruction datasets often concentrate on narrow domains, such as mathematics or coding, limiting generalization and widening the gap with proprietary models. To bridge this gap, we introduce Infinity-Instruct, a high-quality instruction dataset designed to enhance both foundational and chat capabilities of LLMs through a two-phase pipeline. In Phase 1, we curate 7.4M high-quality foundational instructions (InfInstruct-F-7.4M) from over 100M samples using hybrid data selection techniques. In Phase 2, we synthesize 1.5M high-quality chat instructions (InfInstruct-G-1.5M) through a two-stage process involving instruction selection, evolution, and diagnostic filtering. We empirically evaluate Infinity-Instruct by fine-tuning several open-source models, including Mistral, LLaMA, Qwen, and Yi, and observe substantial performance gains across both foundational and instruction following benchmarks, consistently surpassing official instruction-tuned counterparts. Notably, InfInstruct-LLaMA3.1-70B outperforms GPT-4-0314 by 8.6\% on instruction following tasks while achieving comparable foundational performance. These results underscore the synergy between foundational and chat training and offer new insights into holistic LLM development. Our dataset\footnote{https://huggingface.co/datasets/BAAI/Infinity-Instruct} and codes\footnote{https://gitee.com/li-touch/infinity-instruct} have been publicly released.

Infinity Instruct: Scaling Instruction Selection and Synthesis to Enhance Language Models

TL;DR

Infinity-Instruct presents a scalable, data-centric pipeline to construct high-quality open-source instruction datasets by combining data selection and synthesis. The two-phase design yields InfInstruct-F-7.4M (foundational) and InfInstruct-G-1.5M (conversational), with rigorous deduplication and contamination filtering. Fine-tuning multiple open-source LLMs on these datasets achieves state-of-the-art performance on both foundational and instruction-following benchmarks, with Llama3.1-70B surpassing GPT-4 in chat while matching foundational tasks. The work highlights a strong synergy between foundational and conversational training and demonstrates the potential of open datasets to narrow the gap with proprietary models. Public release of datasets and code further enables broad adoption and continued advancement in LLM alignment and generalization.

Abstract

Large Language Models (LLMs) demonstrate strong performance in real-world applications, yet existing open-source instruction datasets often concentrate on narrow domains, such as mathematics or coding, limiting generalization and widening the gap with proprietary models. To bridge this gap, we introduce Infinity-Instruct, a high-quality instruction dataset designed to enhance both foundational and chat capabilities of LLMs through a two-phase pipeline. In Phase 1, we curate 7.4M high-quality foundational instructions (InfInstruct-F-7.4M) from over 100M samples using hybrid data selection techniques. In Phase 2, we synthesize 1.5M high-quality chat instructions (InfInstruct-G-1.5M) through a two-stage process involving instruction selection, evolution, and diagnostic filtering. We empirically evaluate Infinity-Instruct by fine-tuning several open-source models, including Mistral, LLaMA, Qwen, and Yi, and observe substantial performance gains across both foundational and instruction following benchmarks, consistently surpassing official instruction-tuned counterparts. Notably, InfInstruct-LLaMA3.1-70B outperforms GPT-4-0314 by 8.6\% on instruction following tasks while achieving comparable foundational performance. These results underscore the synergy between foundational and chat training and offer new insights into holistic LLM development. Our dataset\footnote{https://huggingface.co/datasets/BAAI/Infinity-Instruct} and codes\footnote{https://gitee.com/li-touch/infinity-instruct} have been publicly released.

Paper Structure

This paper contains 19 sections, 5 figures, 9 tables.

Figures (5)

  • Figure 1: The overall structure for building the Infinity-Instruct dataset.
  • Figure 2: Overall Pipeline of Data Selection Pipeline
  • Figure 3: Overall Pipeline of Data Synthesis Pipeline
  • Figure 4: Scaling curves on foundational and conversational tasks.
  • Figure 5: T-SNE visualization and analysis of the first-level label type distribution of instructions in the InfInstruct-F-7.4M and InfInstruct-G-1.5M dataset. We sampled up to 2000 instructions per label type.