Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking

Nikhil Prakash; Tamar Rott Shaham; Tal Haklay; Yonatan Belinkov; David Bau

Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking

Nikhil Prakash, Tamar Rott Shaham, Tal Haklay, Yonatan Belinkov, David Bau

TL;DR

This study interrogates how fine-tuning on generalized tasks reshapes internal mechanisms in large language models, using entity tracking as a case study. It combines path patching, Desiderata-based Component Masking (DCM), and Cross-Model Activation Patching (CMAP) to show that fine-tuning preserves a core circuit for entity tracking found in the base model while adding components that enhance performance. The authors demonstrate that the same four-group circuit identified in Llama-7B is present in fine-tuned variants and that improved performance stems from enhanced sub-mechanisms, particularly object-value encoding and augmented positional information. These findings support the view that fine-tuning augments, rather than rearchitects, existing mechanisms and provide a general methodology for dissecting mechanism-level changes across models. The work contributes to mechanistic interpretability by offering tools to quantify cross-model changes and to attribute performance gains to specific circuit subfunctions.

Abstract

Fine-tuning on generalized tasks such as instruction following, code generation, and mathematics has been shown to enhance language models' performance on a range of tasks. Nevertheless, explanations of how such fine-tuning influences the internal computations in these models remain elusive. We study how fine-tuning affects the internal mechanisms implemented in language models. As a case study, we explore the property of entity tracking, a crucial facet of language comprehension, where models fine-tuned on mathematics have substantial performance gains. We identify the mechanism that enables entity tracking and show that (i) in both the original model and its fine-tuned versions primarily the same circuit implements entity tracking. In fact, the entity tracking circuit of the original model on the fine-tuned versions performs better than the full original model. (ii) The circuits of all the models implement roughly the same functionality: Entity tracking is performed by tracking the position of the correct entity in both the original model and its fine-tuned versions. (iii) Performance boost in the fine-tuned models is primarily attributed to its improved ability to handle the augmented positional information. To uncover these findings, we employ: Patch Patching, DCM, which automatically detects model components responsible for specific semantics, and CMAP, a new approach for patching activations across models to reveal improved mechanisms. Our findings suggest that fine-tuning enhances, rather than fundamentally alters, the mechanistic operation of the model.

Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking

TL;DR

Abstract

Paper Structure (35 sections, 1 equation, 13 figures, 14 tables)

This paper contains 35 sections, 1 equation, 13 figures, 14 tables.

Introduction
Related Work
Experimental Setup
Is the same circuit present after fine-tuning?
Circuit Discovery in Llama-7b
Circuit Evaluation
Circuit Generalization across fine-tuned models
Is circuit functionality the same after fine-tuning?
Desiderata-based Component Masking
Circuit functionality in Llama-7B
Circuit functionality in fine-tuned models
Why do Goat-7B and FLoat-7B perform better?
Cross-model activation patching
Results
Discussion and Conclusion
...and 20 more sections

Figures (13)

Figure 1: Entity Tracking circuit in Llama-7B ($Cir$). The circuit is composed of 4 groups of heads (A,B,C,D) located at the last token (A,B), query label (C), and previous query label (D) token positions. Each group is illustrated by a prominent head in that group.
Figure 2: Desiderata for identifying circuit functionality. We define different desiderata (sets of an original sentence and a carefully designed counterfactual alternation of it with a known target output) to evaluate various hypotheses regarding the functionality of a subset of heads within the circuit. (a) when patching heads encoding information about the correct object, the object information from the alternate run (e.g. "cup") is implanted into the original run. (b) heads sensitive to the query box label will be affected by the alternate query label (e.g. "Box Y") causing the output to be the object in that box. (c) Position encoding heads represent positional information of the correct object token. Therefore patching these, the model outputs the object at the location of the correct object token in the alternate sequence (e.g. "cross" which is located at the same position as "computer".)
Figure 3: Circuit Functionality in Llama-7B, Vicuna-7B, Goat-7B, and FLoat-7B. We use DCM to uncover functionality of each subgroup of Llama-7B circuit. Group A (pink) is mainly sensitive to value desideratum, while groups B, C (purple, turquoise) are responsible for positional information. We find group D insensitive to each of the three desideratum. Error bars indicate standard deviation.
Figure 4: Why do Goat-7B and FLoat-7B perform better? We use CMAP to patch activations of the Goat-7B and FLoat-7B circuit components, from Goat-7B and FLoat-7B to Llama-7B model respectively, to attribute the performance improvement to a specific sub-mechanism used to perform entity tracking tasks. We patch the output of the subset of heads in each group that are involved in the primary functionality. We find that patching Value Fetcher heads can solely improve the performance of Llama-7B to that of Goat-7B and FLoat-7B. Additionally, we also observe a significant performance boost when the output of Position Transmitter heads is patched.
Figure A1: Critical Information Flow to Last Token Position
...and 8 more figures

Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking

TL;DR

Abstract

Fine-Tuning Enhances Existing Mechanisms: A Case Study on Entity Tracking

Authors

TL;DR

Abstract

Table of Contents

Figures (13)