Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking
Nikhil Prakash, Tamar Rott Shaham, Tal Haklay, Yonatan Belinkov, David Bau
TL;DR
This study interrogates how fine-tuning on generalized tasks reshapes internal mechanisms in large language models, using entity tracking as a case study. It combines path patching, Desiderata-based Component Masking (DCM), and Cross-Model Activation Patching (CMAP) to show that fine-tuning preserves a core circuit for entity tracking found in the base model while adding components that enhance performance. The authors demonstrate that the same four-group circuit identified in Llama-7B is present in fine-tuned variants and that improved performance stems from enhanced sub-mechanisms, particularly object-value encoding and augmented positional information. These findings support the view that fine-tuning augments, rather than rearchitects, existing mechanisms and provide a general methodology for dissecting mechanism-level changes across models. The work contributes to mechanistic interpretability by offering tools to quantify cross-model changes and to attribute performance gains to specific circuit subfunctions.
Abstract
Fine-tuning on generalized tasks such as instruction following, code generation, and mathematics has been shown to enhance language models' performance on a range of tasks. Nevertheless, explanations of how such fine-tuning influences the internal computations in these models remain elusive. We study how fine-tuning affects the internal mechanisms implemented in language models. As a case study, we explore the property of entity tracking, a crucial facet of language comprehension, where models fine-tuned on mathematics have substantial performance gains. We identify the mechanism that enables entity tracking and show that (i) in both the original model and its fine-tuned versions primarily the same circuit implements entity tracking. In fact, the entity tracking circuit of the original model on the fine-tuned versions performs better than the full original model. (ii) The circuits of all the models implement roughly the same functionality: Entity tracking is performed by tracking the position of the correct entity in both the original model and its fine-tuned versions. (iii) Performance boost in the fine-tuned models is primarily attributed to its improved ability to handle the augmented positional information. To uncover these findings, we employ: Patch Patching, DCM, which automatically detects model components responsible for specific semantics, and CMAP, a new approach for patching activations across models to reveal improved mechanisms. Our findings suggest that fine-tuning enhances, rather than fundamentally alters, the mechanistic operation of the model.
