Accelerate Scaling of LLM Finetuning via Quantifying the Coverage and Depth of Instruction Set

Chengwei Wu; Li Du; Hanyu Zhao; Yiming Ju; Jiapu Wang; Tianyu Chen; Haoyi Zhou

Accelerate Scaling of LLM Finetuning via Quantifying the Coverage and Depth of Instruction Set

Chengwei Wu, Li Du, Hanyu Zhao, Yiming Ju, Jiapu Wang, Tianyu Chen, Haoyi Zhou

TL;DR

The paper identifies semantic coverage and information depth as the key drivers of scalable supervised fine-tuning, addressing why simply increasing data often yields diminishing returns. It formalizes an information landscape view and introduces proxy indicators, including RID and ID, to quantify coverage and depth, demonstrating that these factors explain a large portion of validation-loss variance. The authors propose Information Landscape Approximation (ILA), a model-agnostic data-refinement algorithm that preserves coverage while maximizing local information depth, achieving faster and more sustained gains than prior selection methods across diverse tasks and model sizes. Empirical results span general-domain and math reasoning, showing accelerated scaling and highlighting that selecting highly informative, well-distributed instruction data is crucial for efficient SFT.

Abstract

Scaling the amount of data used for supervied fine-tuning(SFT) does not guarantee the proportional gains in model performance, highlighting a critical need to understand what makes training samples effective. This work identifies two fundamental dataset properties that govern SFT scalability: \textbf{semantic coverage}, or the breadth of task domains, and \textbf{information depth}, or the richness of individual examples. We demonstrate that simple proxies for these properties explain the majority of validation loss variance in our experiments. In this work, we further propose the \textbf{Information Landscape Approximation (ILA)}, a model-agnostic data selection framework that jointly optimizes for these two factors. ILA constructs compact subsets that approximate the informational value of large datasets. Empirical results show that models tuned on ILA-selected data achieve faster and more sustained performance improvements across diverse tasks and model sizes compared to existing methods, a phenomenon we term \textbf{accelerated scaling}.

Accelerate Scaling of LLM Finetuning via Quantifying the Coverage and Depth of Instruction Set

TL;DR

Abstract

Accelerate Scaling of LLM Finetuning via Quantifying the Coverage and Depth of Instruction Set

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)