The Right Time Matters: Data Arrangement Affects Zero-Shot Generalization in Instruction Tuning

Bingxiang He; Ning Ding; Cheng Qian; Jia Deng; Ganqu Cui; Lifan Yuan; Haiwen Hong; Huan-ang Gao; Longtao Huang; Hui Xue; Huimin Chen; Zhiyuan Liu; Maosong Sun

The Right Time Matters: Data Arrangement Affects Zero-Shot Generalization in Instruction Tuning

Bingxiang He, Ning Ding, Cheng Qian, Jia Deng, Ganqu Cui, Lifan Yuan, Haiwen Hong, Huan-ang Gao, Longtao Huang, Hui Xue, Huimin Chen, Zhiyuan Liu, Maosong Sun

TL;DR

This work reframes zero-shot generalization in instruction tuning from a task-centered view to a data-centric perspective, showing that generalization arises very early and is best tracked by loss rather than traditional metrics. It analyzes how training data arrangement—through similarity to test data and granularity of instructions—drives rapid or delayed generalization, demonstrating that high-similarity, fine-grained data exposed early fosters stronger unseen-task performance. The authors introduce Test-centric Multi-turn Arrangement (TMA), a framework that organizes training data around test data characteristics to promote continual learning and further loss reduction, with strong empirical gains across multiple datasets. Overall, the study offers a principled data-ahead approach to improving zero-shot generalization in instruction-tuned LLMs and highlights practical considerations for data curation and training strategies.

Abstract

Understanding alignment techniques begins with comprehending zero-shot generalization brought by instruction tuning, but little of the mechanism has been understood. Existing work has largely been confined to the task level, without considering that tasks are artificially defined and, to LLMs, merely consist of tokens and representations. To bridge this gap, we investigate zero-shot generalization from the perspective of the data itself. We first demonstrate that zero-shot generalization happens very early during instruction tuning, with loss serving as a stable indicator. Next, we investigate training data arrangement through similarity and granularity perspectives, confirming that the timing of exposure to certain training examples may greatly facilitate generalization on unseen tasks. Finally, we propose a more grounded training data arrangement framework, Test-centric Multi-turn Arrangement, and show its effectiveness in promoting continual learning and further loss reduction. For the first time, we show that zero-shot generalization during instruction tuning is a form of similarity-based generalization between training and test data at the instance level. Our code is released at https://github.com/thunlp/Dynamics-of-Zero-Shot-Generalization.

The Right Time Matters: Data Arrangement Affects Zero-Shot Generalization in Instruction Tuning

TL;DR

Abstract

Paper Structure (49 sections, 1 theorem, 16 equations, 16 figures, 4 tables, 1 algorithm)

This paper contains 49 sections, 1 theorem, 16 equations, 16 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Positioning Zero-Shot Generalization
Early Zero-Shot Generalization
Loss as Generalization Indicator
Data Arrangement Effects on Zero-Shot Generalization
Pilot Study
Through Data Similarity and Granularity
Effect of High-Similarity Data
Effect of Fine-Grained Data
Test-centric Multi-turn Arrangement
TMA Improves Zero-Shot Generalization
Ablation Study
Conclusion
Details for \ref{['section2']}
...and 34 more sections

Key Result

Theorem B.1

Optimal Substructure of Cosine-Avg and Cosine-Min: Let $f$ be a function for calculating dataset-level similarity distance (Cosine-Avg and Cosine-Min), taking two sets $A$ and $B$ as inputs and outputs a real number. Given a training set $\mathcal{D}_{\text{train}}$ and a test set $\mathcal{D}_{\tex

Figures (16)

Figure 1: Demonstrating how data arrangement affects zero-shot generalization. Different shapes represent distinct task types, while similar colors indicate semantic similarities between data points. Top and Bottom respectively represent traditional random data ordering and task-based continue fine-tuning, showing gradual loss reduction. But we (Middle) prioritize training on data points that are similar (color) to the test set and break free from task boundary (shape), thus enabling more rapid loss reduction.
Figure 2: Average ROUGE-1, ROUGE-L, and Exact-Match scores (left), average RM scores (middle), and average loss scores (right) of checkpoints fine-tuned on NIV2 (left, middle, right), P3 (middle, right), and Flan-mini (middle, right), all evaluated on unseen tasks.
Figure 3: Sudden decrease in the average loss under cluster scheduling for the three tasks at steps 400, 450, and 150 respectively.
Figure 4: An overview of Round Robin, Random and Cluster data arrangements. Definitions of colors and shapes are consistent with those in \ref{['fig:intro']}.
Figure 5: An overview of NFT and FFT data arrangements. Definitions of colors and shapes are consistent with those in \ref{['fig:intro']}.
...and 11 more figures

Theorems & Definitions (1)

Theorem B.1

The Right Time Matters: Data Arrangement Affects Zero-Shot Generalization in Instruction Tuning

TL;DR

Abstract

The Right Time Matters: Data Arrangement Affects Zero-Shot Generalization in Instruction Tuning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (1)