Table of Contents
Fetching ...

How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence

Hongzhe Du, Weikai Li, Min Cai, Karim Saraipour, Zimin Zhang, Himabindu Lakkaraju, Yizhou Sun, Shichang Zhang

TL;DR

This work provides a mechanistic, cross-model examination of post-training in LLMs across four axes: knowledge storage/representation, internal truthfulness beliefs, refusal behavior, and confidence. Using causal tracing, linear truthfulness and refusal directions, and cross-model patching across Llama-3.1-8B, Mistral-7B, and Llama-2-13B, it shows that knowledge storage locations and truthfulness directions are largely preserved after post-training, while refusal directions are reshaped and exhibit limited forward transfer; confidence differences are not explained by entropy neurons. The findings suggest practical paths for steering and transferring knowledge edits from base to post models, and potential for transferring post-developed capabilities back to base, informing future interpretability and post-training strategies. Code for reproducing the analyses is publicly available at https://github.com/HZD01/post-training-mechanistic-analysis.

Abstract

Post-training is essential for the success of large language models (LLMs), transforming pre-trained base models into more useful and aligned post-trained models. While plenty of works have studied post-training algorithms and evaluated post-training models by their outputs, it remains understudied how post-training reshapes LLMs internally. In this paper, we compare base and post-trained LLMs mechanistically from four perspectives to better understand post-training effects. Our findings across model families and datasets reveal that: (1) Post-training does not change the factual knowledge storage locations, and it adapts knowledge representations from the base model while developing new knowledge representations; (2) Both truthfulness and refusal can be represented by vectors in the hidden representation space. The truthfulness direction is highly similar between the base and post-trained model, and it is effectively transferable for interventions; (3) The refusal direction is different between the base and post-trained models, and it shows limited forward transferability; (4) Differences in confidence between the base and post-trained models cannot be attributed to entropy neurons. Our study provides insights into the fundamental mechanisms preserved and altered during post-training, facilitates downstream tasks like model steering, and could potentially benefit future research in interpretability and LLM post-training. Our code is publicly available at https://github.com/HZD01/post-training-mechanistic-analysis.

How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence

TL;DR

This work provides a mechanistic, cross-model examination of post-training in LLMs across four axes: knowledge storage/representation, internal truthfulness beliefs, refusal behavior, and confidence. Using causal tracing, linear truthfulness and refusal directions, and cross-model patching across Llama-3.1-8B, Mistral-7B, and Llama-2-13B, it shows that knowledge storage locations and truthfulness directions are largely preserved after post-training, while refusal directions are reshaped and exhibit limited forward transfer; confidence differences are not explained by entropy neurons. The findings suggest practical paths for steering and transferring knowledge edits from base to post models, and potential for transferring post-developed capabilities back to base, informing future interpretability and post-training strategies. Code for reproducing the analyses is publicly available at https://github.com/HZD01/post-training-mechanistic-analysis.

Abstract

Post-training is essential for the success of large language models (LLMs), transforming pre-trained base models into more useful and aligned post-trained models. While plenty of works have studied post-training algorithms and evaluated post-training models by their outputs, it remains understudied how post-training reshapes LLMs internally. In this paper, we compare base and post-trained LLMs mechanistically from four perspectives to better understand post-training effects. Our findings across model families and datasets reveal that: (1) Post-training does not change the factual knowledge storage locations, and it adapts knowledge representations from the base model while developing new knowledge representations; (2) Both truthfulness and refusal can be represented by vectors in the hidden representation space. The truthfulness direction is highly similar between the base and post-trained model, and it is effectively transferable for interventions; (3) The refusal direction is different between the base and post-trained models, and it shows limited forward transferability; (4) Differences in confidence between the base and post-trained models cannot be attributed to entropy neurons. Our study provides insights into the fundamental mechanisms preserved and altered during post-training, facilitates downstream tasks like model steering, and could potentially benefit future research in interpretability and LLM post-training. Our code is publicly available at https://github.com/HZD01/post-training-mechanistic-analysis.

Paper Structure

This paper contains 42 sections, 4 equations, 25 figures, 22 tables.

Figures (25)

  • Figure 1: Summary of our analysis and findings. (a) Knowledge: A difference heatmap showing base and post models have negligible location differences for storing the same knowledge; (b) Truthfulness: A PCA plot showing the truthfulness directions are similar in base and post models; (c) Refusal: A PCA plot showing the refusal directions of base and post models are quite different; (d) Confidence: A Venn diagram of entropy neuron IDs showing the difference in confidence between base and post models cannot be fully attributed to entropy neurons as they largely overlap.
  • Figure 2: Knowledge storage locations of Llama-3.1-8B base and instruct on the cities dataset. Their knowledge-storage locations are almost the same.
  • Figure 3: Cosine similarities of truthfulness (a and b) and refusal (c) directions of Llama-3.1-8B base, instruct, and sft. Truthfulness directions are similar while refusal directions are different.
  • Figure 4: Example output of Llama-3.1-8b-Instruct with intervention. Gray box shows the next token output with the highest predicted probability with its corresponding probability in the parentheses. Transferred intervention can flip the output as successfully as native intervention.
  • Figure 5: Refusal keywords used to detect refusal behavior.
  • ...and 20 more figures