Infinity Instruct: Scaling Instruction Selection and Synthesis to Enhance Language Models
Jijie Li, Li Du, Hanyu Zhao, Bo-wen Zhang, Liangdong Wang, Boyan Gao, Guang Liu, Yonghua Lin
TL;DR
Infinity-Instruct presents a scalable, data-centric pipeline to construct high-quality open-source instruction datasets by combining data selection and synthesis. The two-phase design yields InfInstruct-F-7.4M (foundational) and InfInstruct-G-1.5M (conversational), with rigorous deduplication and contamination filtering. Fine-tuning multiple open-source LLMs on these datasets achieves state-of-the-art performance on both foundational and instruction-following benchmarks, with Llama3.1-70B surpassing GPT-4 in chat while matching foundational tasks. The work highlights a strong synergy between foundational and conversational training and demonstrates the potential of open datasets to narrow the gap with proprietary models. Public release of datasets and code further enables broad adoption and continued advancement in LLM alignment and generalization.
Abstract
Large Language Models (LLMs) demonstrate strong performance in real-world applications, yet existing open-source instruction datasets often concentrate on narrow domains, such as mathematics or coding, limiting generalization and widening the gap with proprietary models. To bridge this gap, we introduce Infinity-Instruct, a high-quality instruction dataset designed to enhance both foundational and chat capabilities of LLMs through a two-phase pipeline. In Phase 1, we curate 7.4M high-quality foundational instructions (InfInstruct-F-7.4M) from over 100M samples using hybrid data selection techniques. In Phase 2, we synthesize 1.5M high-quality chat instructions (InfInstruct-G-1.5M) through a two-stage process involving instruction selection, evolution, and diagnostic filtering. We empirically evaluate Infinity-Instruct by fine-tuning several open-source models, including Mistral, LLaMA, Qwen, and Yi, and observe substantial performance gains across both foundational and instruction following benchmarks, consistently surpassing official instruction-tuned counterparts. Notably, InfInstruct-LLaMA3.1-70B outperforms GPT-4-0314 by 8.6\% on instruction following tasks while achieving comparable foundational performance. These results underscore the synergy between foundational and chat training and offer new insights into holistic LLM development. Our dataset\footnote{https://huggingface.co/datasets/BAAI/Infinity-Instruct} and codes\footnote{https://gitee.com/li-touch/infinity-instruct} have been publicly released.
