Table of Contents
Fetching ...

Accelerate Scaling of LLM Finetuning via Quantifying the Coverage and Depth of Instruction Set

Chengwei Wu, Li Du, Hanyu Zhao, Yiming Ju, Jiapu Wang, Tianyu Chen, Haoyi Zhou

TL;DR

The paper identifies semantic coverage and information depth as the key drivers of scalable supervised fine-tuning, addressing why simply increasing data often yields diminishing returns. It formalizes an information landscape view and introduces proxy indicators, including RID and ID, to quantify coverage and depth, demonstrating that these factors explain a large portion of validation-loss variance. The authors propose Information Landscape Approximation (ILA), a model-agnostic data-refinement algorithm that preserves coverage while maximizing local information depth, achieving faster and more sustained gains than prior selection methods across diverse tasks and model sizes. Empirical results span general-domain and math reasoning, showing accelerated scaling and highlighting that selecting highly informative, well-distributed instruction data is crucial for efficient SFT.

Abstract

Scaling the amount of data used for supervied fine-tuning(SFT) does not guarantee the proportional gains in model performance, highlighting a critical need to understand what makes training samples effective. This work identifies two fundamental dataset properties that govern SFT scalability: \textbf{semantic coverage}, or the breadth of task domains, and \textbf{information depth}, or the richness of individual examples. We demonstrate that simple proxies for these properties explain the majority of validation loss variance in our experiments. In this work, we further propose the \textbf{Information Landscape Approximation (ILA)}, a model-agnostic data selection framework that jointly optimizes for these two factors. ILA constructs compact subsets that approximate the informational value of large datasets. Empirical results show that models tuned on ILA-selected data achieve faster and more sustained performance improvements across diverse tasks and model sizes compared to existing methods, a phenomenon we term \textbf{accelerated scaling}.

Accelerate Scaling of LLM Finetuning via Quantifying the Coverage and Depth of Instruction Set

TL;DR

The paper identifies semantic coverage and information depth as the key drivers of scalable supervised fine-tuning, addressing why simply increasing data often yields diminishing returns. It formalizes an information landscape view and introduces proxy indicators, including RID and ID, to quantify coverage and depth, demonstrating that these factors explain a large portion of validation-loss variance. The authors propose Information Landscape Approximation (ILA), a model-agnostic data-refinement algorithm that preserves coverage while maximizing local information depth, achieving faster and more sustained gains than prior selection methods across diverse tasks and model sizes. Empirical results span general-domain and math reasoning, showing accelerated scaling and highlighting that selecting highly informative, well-distributed instruction data is crucial for efficient SFT.

Abstract

Scaling the amount of data used for supervied fine-tuning(SFT) does not guarantee the proportional gains in model performance, highlighting a critical need to understand what makes training samples effective. This work identifies two fundamental dataset properties that govern SFT scalability: \textbf{semantic coverage}, or the breadth of task domains, and \textbf{information depth}, or the richness of individual examples. We demonstrate that simple proxies for these properties explain the majority of validation loss variance in our experiments. In this work, we further propose the \textbf{Information Landscape Approximation (ILA)}, a model-agnostic data selection framework that jointly optimizes for these two factors. ILA constructs compact subsets that approximate the informational value of large datasets. Empirical results show that models tuned on ILA-selected data achieve faster and more sustained performance improvements across diverse tasks and model sizes compared to existing methods, a phenomenon we term \textbf{accelerated scaling}.

Paper Structure

This paper contains 35 sections, 6 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: (a) Illustration of the information depth, coverage, and domain distribution of an instruction set; (b) The dev-loss of an finetuned model can be well fitted using the information depth and coverage of the instruction set for fine-tuning; (c) Performance of ILA scales up faster than simply enlarging the size of instruction set and SoTA instruction selection methods, suggesting a "accelerated scale" behavior.
  • Figure 2: (a) The calculation of the proxy indicators measuring the information depth and coverage of an instruction set, which further forms into a landscape characterizing the distribution of an instruction set. (b) Illustration of the information landscape approximation (ILA) instruction refinement algorithm, which makes the information landscape of the selected subset approximate that of the original instruction pool.
  • Figure 3: (a) Regression results of the dev-loss vs. coverage and depth of instruction sets; (b) Scatter plot of predicted vs. actual dev loss.
  • Figure 4: The x-axis represents the number of tokens, the y-axis shows the evaluation metric scores; the dashed lines connect results obtained using an equal number of instructions.
  • Figure 5: Information depth and coverage of subsets selected by ILA, Deita, and Random Selection.
  • ...and 1 more figures