From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning
Xuansheng Wu, Wenlin Yao, Jianshu Chen, Xiaoman Pan, Xiaoyang Wang, Ninghao Liu, Dong Yu
TL;DR
<3-5 sentence high-level summary> This work investigates how instruction tuning alters internal behavior of large language models by developing an interpretability toolbox that analyzes gradient-based input-output attributions, self-attention word relations, and FFN weight-derived concepts. By comparing pre-trained and instruction-tuned LLaMA/Mistral families (including Vicuna variants), the authors identify three core effects: instruction words become consistently influential in guiding generation, self-attention heads encode more instruction-verb relations, and FFNs rotate pre-trained knowledge toward user-oriented tasks without changing linguistic-level distributions. The study provides both qualitative visualizations and quantitative metrics (e.g., importance density) to explain how instruction tuning aligns models with user intent and lays groundwork for optimizing LLMs for practical applications. The toolbox offers a concrete path toward more interpretable and controllable instruction-following in future AI systems.
Abstract
Large Language Models (LLMs) have achieved remarkable success, where instruction tuning is the critical step in aligning LLMs with user intentions. In this work, we investigate how the instruction tuning adjusts pre-trained models with a focus on intrinsic changes. Specifically, we first develop several local and global explanation methods, including a gradient-based method for input-output attribution, and techniques for interpreting patterns and concepts in self-attention and feed-forward layers. The impact of instruction tuning is then studied by comparing the explanations derived from the pre-trained and instruction-tuned models. This approach provides an internal perspective of the model shifts on a human-comprehensible level. Our findings reveal three significant impacts of instruction tuning: 1) It empowers LLMs to recognize the instruction parts of user prompts, and promotes the response generation constantly conditioned on the instructions. 2) It encourages the self-attention heads to capture more word-word relationships about instruction verbs. 3) It encourages the feed-forward networks to rotate their pre-trained knowledge toward user-oriented tasks. These insights contribute to a more comprehensive understanding of instruction tuning and lay the groundwork for future work that aims at explaining and optimizing LLMs for various applications. Our code and data are publicly available at https://github.com/JacksonWuxs/Interpret_Instruction_Tuning_LLMs.
