How Post-Training Reshapes LLMs: A Mechanistic View on Knowledge, Truthfulness, Refusal, and Confidence
Hongzhe Du, Weikai Li, Min Cai, Karim Saraipour, Zimin Zhang, Himabindu Lakkaraju, Yizhou Sun, Shichang Zhang
TL;DR
This work provides a mechanistic, cross-model examination of post-training in LLMs across four axes: knowledge storage/representation, internal truthfulness beliefs, refusal behavior, and confidence. Using causal tracing, linear truthfulness and refusal directions, and cross-model patching across Llama-3.1-8B, Mistral-7B, and Llama-2-13B, it shows that knowledge storage locations and truthfulness directions are largely preserved after post-training, while refusal directions are reshaped and exhibit limited forward transfer; confidence differences are not explained by entropy neurons. The findings suggest practical paths for steering and transferring knowledge edits from base to post models, and potential for transferring post-developed capabilities back to base, informing future interpretability and post-training strategies. Code for reproducing the analyses is publicly available at https://github.com/HZD01/post-training-mechanistic-analysis.
Abstract
Post-training is essential for the success of large language models (LLMs), transforming pre-trained base models into more useful and aligned post-trained models. While plenty of works have studied post-training algorithms and evaluated post-training models by their outputs, it remains understudied how post-training reshapes LLMs internally. In this paper, we compare base and post-trained LLMs mechanistically from four perspectives to better understand post-training effects. Our findings across model families and datasets reveal that: (1) Post-training does not change the factual knowledge storage locations, and it adapts knowledge representations from the base model while developing new knowledge representations; (2) Both truthfulness and refusal can be represented by vectors in the hidden representation space. The truthfulness direction is highly similar between the base and post-trained model, and it is effectively transferable for interventions; (3) The refusal direction is different between the base and post-trained models, and it shows limited forward transferability; (4) Differences in confidence between the base and post-trained models cannot be attributed to entropy neurons. Our study provides insights into the fundamental mechanisms preserved and altered during post-training, facilitates downstream tasks like model steering, and could potentially benefit future research in interpretability and LLM post-training. Our code is publicly available at https://github.com/HZD01/post-training-mechanistic-analysis.
