Table of Contents
Fetching ...

How Instruction and Reasoning Data shape Post-Training: Data Quality through the Lens of Layer-wise Gradients

Ming Li, Yanhong Li, Ziyue Li, Tianyi Zhou

TL;DR

This work introduces a gradient-based spectral framework to study how data quality during LLM post-training influences layer-wise gradient dynamics for instruction and reasoning data. By defining SVD-based metrics (nuclear norm and effective rank) and similarity-based metrics across projection layers, the authors unify various data-quality signals and demonstrate that high-quality data produce smaller gradient magnitudes but richer, higher-rank gradient directions. Remarkably, reasoning data yield even higher effective ranks than instruction data, suggesting more diverse gradient updates for complex tasks. The findings hold across model families and scales, though gradient patterns differ by architecture; fast vs slow thinking analyses further reveal that explicit reasoning prompts align with more stable, multi-direction updates. The work offers a spectral lens to guide data selection and post-training strategies beyond traditional end-task metrics.

Abstract

As the post-training of large language models (LLMs) advances from instruction-following to complex reasoning tasks, understanding how different data affect finetuning dynamics remains largely unexplored. In this paper, we present a spectral analysis of layer-wise gradients induced by low/high-quality instruction and reasoning data for LLM post-training. Our analysis reveals that widely-studied metrics for data evaluation, e.g., IFD, InsTag, Difficulty, and Reward, can be explained and unified by spectral properties computed from gradients' singular value decomposition (SVD). Specifically, higher-quality data are usually associated with lower nuclear norms and higher effective ranks. Notably, effective rank exhibits better robustness and resolution than nuclear norm in capturing subtle quality differences. For example, reasoning data achieves substantially higher effective ranks than instruction data, implying richer gradient structures on more complex tasks. Our experiments also highlight that models within the same family share similar gradient patterns regardless of their sizes, whereas different model families diverge significantly. Providing a unified view on the effects of data quality across instruction and reasoning data, this work illuminates the interplay between data quality and training stability, shedding novel insights into developing better data exploration strategies for post-training.

How Instruction and Reasoning Data shape Post-Training: Data Quality through the Lens of Layer-wise Gradients

TL;DR

This work introduces a gradient-based spectral framework to study how data quality during LLM post-training influences layer-wise gradient dynamics for instruction and reasoning data. By defining SVD-based metrics (nuclear norm and effective rank) and similarity-based metrics across projection layers, the authors unify various data-quality signals and demonstrate that high-quality data produce smaller gradient magnitudes but richer, higher-rank gradient directions. Remarkably, reasoning data yield even higher effective ranks than instruction data, suggesting more diverse gradient updates for complex tasks. The findings hold across model families and scales, though gradient patterns differ by architecture; fast vs slow thinking analyses further reveal that explicit reasoning prompts align with more stable, multi-direction updates. The work offers a spectral lens to guide data selection and post-training strategies beyond traditional end-task metrics.

Abstract

As the post-training of large language models (LLMs) advances from instruction-following to complex reasoning tasks, understanding how different data affect finetuning dynamics remains largely unexplored. In this paper, we present a spectral analysis of layer-wise gradients induced by low/high-quality instruction and reasoning data for LLM post-training. Our analysis reveals that widely-studied metrics for data evaluation, e.g., IFD, InsTag, Difficulty, and Reward, can be explained and unified by spectral properties computed from gradients' singular value decomposition (SVD). Specifically, higher-quality data are usually associated with lower nuclear norms and higher effective ranks. Notably, effective rank exhibits better robustness and resolution than nuclear norm in capturing subtle quality differences. For example, reasoning data achieves substantially higher effective ranks than instruction data, implying richer gradient structures on more complex tasks. Our experiments also highlight that models within the same family share similar gradient patterns regardless of their sizes, whereas different model families diverge significantly. Providing a unified view on the effects of data quality across instruction and reasoning data, this work illuminates the interplay between data quality and training stability, shedding novel insights into developing better data exploration strategies for post-training.

Paper Structure

This paper contains 39 sections, 6 equations, 96 figures, 60 tables.

Figures (96)

  • Figure 1: Low/high-quality data (measured by Reward) and their gradient properties (nuclear norms and effective ranks) across layers on diverse datasets including WizardLM, OpenHermes 2.5, and Magpie. The y-axis scales are kept the same for nuclear norms, while different for effective ranks, due to the large discrepancy. For each specific model, the shapes of the gradient curves derived from different data sources are almost the same. The nuclear norm fails to reflect the quality discrepancies between datasets, while the effective ranks still works promisingly, e.g., Magpie has higher rank than others.
  • Figure 2: Model size scaling law for gradient properties. Within the same model family, the layer-wise gradient statistics and dynamics are relatively consistent. Gradients on larger models exhibit better capabilities to distinguish data quality, revealed by the increasing y-axis scales from the $1.5$B model to $14$B model.
  • Figure 3: Gradient properties across different model families. The gradient dynamics of the same data on different model families are largely different. This might be caused by their distinct model structures or training recipes and may reflect their different capabilities.
  • Figure 4: The prompt for InsTag.
  • Figure 5: The prompt for Difficulty.
  • ...and 91 more figures