Table of Contents
Fetching ...

Tracking Universal Features Through Fine-Tuning and Model Merging

Niels Horn, Desmond Elliott

TL;DR

This exploration aims to provide deeper insights into the stability and transformation of features across typical transfer-learning scenarios using small-scale models and sparse auto-encoders.

Abstract

We study how features emerge, disappear, and persist across models fine-tuned on different domains of text. More specifically, we start from a base one-layer Transformer language model that is trained on a combination of the BabyLM corpus, and a collection of Python code from The Stack. This base model is adapted to two new domains of text: TinyStories, and the Lua programming language, respectively; and then these two models are merged using these two models using spherical linear interpolation. Our exploration aims to provide deeper insights into the stability and transformation of features across typical transfer-learning scenarios using small-scale models and sparse auto-encoders.

Tracking Universal Features Through Fine-Tuning and Model Merging

TL;DR

This exploration aims to provide deeper insights into the stability and transformation of features across typical transfer-learning scenarios using small-scale models and sparse auto-encoders.

Abstract

We study how features emerge, disappear, and persist across models fine-tuned on different domains of text. More specifically, we start from a base one-layer Transformer language model that is trained on a combination of the BabyLM corpus, and a collection of Python code from The Stack. This base model is adapted to two new domains of text: TinyStories, and the Lua programming language, respectively; and then these two models are merged using these two models using spherical linear interpolation. Our exploration aims to provide deeper insights into the stability and transformation of features across typical transfer-learning scenarios using small-scale models and sparse auto-encoders.

Paper Structure

This paper contains 17 sections, 1 equation, 6 figures.

Figures (6)

  • Figure 1: Overview of the experimental design. We start with a base model trained on BabyLM and Python code (1), which is fine-tuned (FT) on two new domains: the Lua programming language (2), and TinyStories (3). The fine-tuned models are merged into a single LuaStories model using spherical linear interpolation (SLERP) interpolation (4). For each of these models, we train a sparse auto-encoder on the MLP activations using the same data distribution as the original model.
  • Figure 2: Overview of the features persisting through fine-tuning and model merging, showing volumes and trajectories of extracted features that emerge, persist and disappear. This overview omits the features that don't persist, and so the visual flows are scaled proportional to the persisting features. We note the share of features that persist from each model.
  • Figure 3: Visualisation of the feature activation patterns of the universally extracted variable assignment features found in each model. Each token is highlighted according to the feature's activation level, where darker background colour denotes higher level of activation. Additionally, we note the observed activation pattern correlations between each feature.
  • Figure 4: Examples of observed activation patterns of the BabyPython Python exception feature, and the closest matching feature in the Lua model, qualitatively showing insufficient correlation between the two.
  • Figure 5: Observed accuracy trends of a merged model consisting of weights spherically linear interpolated between the Lua model and the TinyStories model, as measured on the validation datasets of the Lua and TinyStories, respectively, at each interpolation step (slerp $t$-value). The dashed baselines show the accuracy of the shared base model underlying the Lua and TinyStories model, on the same validation datasets.
  • ...and 1 more figures