From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning

Xuansheng Wu; Wenlin Yao; Jianshu Chen; Xiaoman Pan; Xiaoyang Wang; Ninghao Liu; Dong Yu

From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning

Xuansheng Wu, Wenlin Yao, Jianshu Chen, Xiaoman Pan, Xiaoyang Wang, Ninghao Liu, Dong Yu

TL;DR

<3-5 sentence high-level summary> This work investigates how instruction tuning alters internal behavior of large language models by developing an interpretability toolbox that analyzes gradient-based input-output attributions, self-attention word relations, and FFN weight-derived concepts. By comparing pre-trained and instruction-tuned LLaMA/Mistral families (including Vicuna variants), the authors identify three core effects: instruction words become consistently influential in guiding generation, self-attention heads encode more instruction-verb relations, and FFNs rotate pre-trained knowledge toward user-oriented tasks without changing linguistic-level distributions. The study provides both qualitative visualizations and quantitative metrics (e.g., importance density) to explain how instruction tuning aligns models with user intent and lays groundwork for optimizing LLMs for practical applications. The toolbox offers a concrete path toward more interpretable and controllable instruction-following in future AI systems.

Abstract

Large Language Models (LLMs) have achieved remarkable success, where instruction tuning is the critical step in aligning LLMs with user intentions. In this work, we investigate how the instruction tuning adjusts pre-trained models with a focus on intrinsic changes. Specifically, we first develop several local and global explanation methods, including a gradient-based method for input-output attribution, and techniques for interpreting patterns and concepts in self-attention and feed-forward layers. The impact of instruction tuning is then studied by comparing the explanations derived from the pre-trained and instruction-tuned models. This approach provides an internal perspective of the model shifts on a human-comprehensible level. Our findings reveal three significant impacts of instruction tuning: 1) It empowers LLMs to recognize the instruction parts of user prompts, and promotes the response generation constantly conditioned on the instructions. 2) It encourages the self-attention heads to capture more word-word relationships about instruction verbs. 3) It encourages the feed-forward networks to rotate their pre-trained knowledge toward user-oriented tasks. These insights contribute to a more comprehensive understanding of instruction tuning and lay the groundwork for future work that aims at explaining and optimizing LLMs for various applications. Our code and data are publicly available at https://github.com/JacksonWuxs/Interpret_Instruction_Tuning_LLMs.

From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning

TL;DR

Abstract

Paper Structure (45 sections, 5 equations, 14 figures, 14 tables)

This paper contains 45 sections, 5 equations, 14 figures, 14 tables.

Introduction
Related Work
Interpreting Language Models.
Interpreting Instruction-tuned Models.
Preliminary
Notations
General Experimental Settings
Language Models.
Instruction Datasets.
Impact of User Prompts for Human Alignment
Quantifying Prompt Influence on Generation Process
Method.
Settings.
Assessing Instruction Following Capability with Importance Density
Method.
...and 30 more sections

Figures (14)

Figure 1: Salient maps of the prompt-response pairfrom LLaMA (left) and Vicuna (right).
Figure 2: Differences of word-word patterns between Vicuna and LLaMA over neuron and head levels.
Figure 3: Distribution of concepts at linguistic levels over different model layers.
Figure 4: CaseID: Self-Instruct-1 Prompt: The sentence you are given might be too wordy, complicated, or unclear. Rewrite the sentence and make your writing clearer by keeping it concise. Whenever possible, break complex sentences into multiple sentences and eliminate unnecessary words.\\ n\\ nInput: If you have any questions about my rate or if you find it necessary to increase or decrease the scope for this project, please let me know.\\ n\\ nOutput: Vicuna: Do you have any questions about my rate or do you need to adjust the project scope? Please let me know.
Figure 5: CaseID: LIMA-144 Prompt: In a world where automation is becoming increasingly prevalent, is it more important to prioritize job creation or technological progress? Vicuna: In a world where automation is becoming increasingly prevalent, it is more important to prioritize job creation.\\ n\\ nIn a world where automation is becoming increasingly prevalent, it is more important to prioritize technological progress.\\ n\\ nIn a world where automation is becoming increasingly prevalent, it is important to prioritize both job creation and technological progress.\\ n\\ nIn a world where automation is becoming increasingly prevalent, it is not necessary to prioritize either job creation or technological progress.
...and 9 more figures

From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning

TL;DR

Abstract

From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning

Authors

TL;DR

Abstract

Table of Contents

Figures (14)