Table of Contents
Fetching ...

Learning or Self-aligning? Rethinking Instruction Fine-tuning

Mengjie Ren, Boxi Cao, Hongyu Lin, Cao Liu, Xianpei Han, Ke Zeng, Guanglu Wan, Xunliang Cai, Le Sun

TL;DR

This paper challenges the view that instruction fine-tuning primarily injects new domain knowledge. It proposes a knowledge intervention framework to decouple world-knowledge injection from behavioral norm transfer by probing internal parameter knowledge with in-context learning and manipulating IFT data accordingly. Across four domains and multiple base models, the authors show that consistency between pre- and post-IFT internal knowledge drives performance more than added world knowledge; contextualizing knowledge within prompts further mitigates adverse effects. The findings support a self-alignment view of IFT and offer practical guidance for constructing IFT data and evaluating alignment, with implications for future work on self-alignment, data design, and larger-scale models.

Abstract

Instruction Fine-tuning~(IFT) is a critical phase in building large language models~(LLMs). Previous works mainly focus on the IFT's role in the transfer of behavioral norms and the learning of additional world knowledge. However, the understanding of the underlying mechanisms of IFT remains significantly limited. In this paper, we design a knowledge intervention framework to decouple the potential underlying factors of IFT, thereby enabling individual analysis of different factors. Surprisingly, our experiments reveal that attempting to learn additional world knowledge through IFT often struggles to yield positive impacts and can even lead to markedly negative effects. Further, we discover that maintaining internal knowledge consistency before and after IFT is a critical factor for achieving successful IFT. Our findings reveal the underlying mechanisms of IFT and provide robust support for some very recent and potential future works.

Learning or Self-aligning? Rethinking Instruction Fine-tuning

TL;DR

This paper challenges the view that instruction fine-tuning primarily injects new domain knowledge. It proposes a knowledge intervention framework to decouple world-knowledge injection from behavioral norm transfer by probing internal parameter knowledge with in-context learning and manipulating IFT data accordingly. Across four domains and multiple base models, the authors show that consistency between pre- and post-IFT internal knowledge drives performance more than added world knowledge; contextualizing knowledge within prompts further mitigates adverse effects. The findings support a self-alignment view of IFT and offer practical guidance for constructing IFT data and evaluating alignment, with implications for future work on self-alignment, data design, and larger-scale models.

Abstract

Instruction Fine-tuning~(IFT) is a critical phase in building large language models~(LLMs). Previous works mainly focus on the IFT's role in the transfer of behavioral norms and the learning of additional world knowledge. However, the understanding of the underlying mechanisms of IFT remains significantly limited. In this paper, we design a knowledge intervention framework to decouple the potential underlying factors of IFT, thereby enabling individual analysis of different factors. Surprisingly, our experiments reveal that attempting to learn additional world knowledge through IFT often struggles to yield positive impacts and can even lead to markedly negative effects. Further, we discover that maintaining internal knowledge consistency before and after IFT is a critical factor for achieving successful IFT. Our findings reveal the underlying mechanisms of IFT and provide robust support for some very recent and potential future works.
Paper Structure (27 sections, 2 equations, 4 figures, 13 tables)

This paper contains 27 sections, 2 equations, 4 figures, 13 tables.

Figures (4)

  • Figure 1: Two potential mechanisms for instruction fine-tuning. 1) Learning, which injects world knowledge in IFT data into LLMs; 2) Self-aligning, which aligns queries with knowledge already in LLMs with similar behavioral norms. Elements with the same color are related.
  • Figure 2: The performance of Mistral-7B fine-tuned with instruction datasets of varying consistency ratios. Each dataset is composed of a mixture of incompatible and self-aligning data, and the consistency ratio represents the proportion of self-aligning samples. Note that a consistency ratio of 0 signifies that all data samples are incompatible, whereas a ratio of 1 indicates exclusively self-aligning data. The results of other base models are presented in the Appendix \ref{['sec:appendix ratio-performance']} due to page limitations.
  • Figure 3: The regression analysis between the model performance after fine-tuning, and the knowledge consistency between base model and fine-tuned model. We show the results of Mistral-7B in three evaluations. The grouped linear regression demonstrate the positive correlations between the model performance after IFT and model internal knowledge consistency before and after IFT. Points in the same regression line indicate the results of the same base model fine-tuned with different IFT data of the same domain on the same test set (HOMO, ID, or OOD).
  • Figure 4: The performance of LLaMA-2-13B and LLaMA-2-7B fine-tuned with instruction datasets of varying consistency ratios. Each dataset is composed of a mixture of incompatible and self-aligning data, and the consistency ratio represents the proportion of self-aligning samples.