Table of Contents
Fetching ...

Evaluating Synthetic Activations composed of SAE Latents in GPT-2

Giorgi Giglemiani, Nora Petrova, Chatrik Singh Mangat, Jett Janiak, Stefan Heimersheim

TL;DR

The findings indicate that synthetic activations closely resemble real activations when the authors control for the sparsity and cosine similarity of the constituent SAE latents, which suggests that real activations cannot be explained by a simple"bag of SAE latents"lacking internal structure, and instead suggests that SAE latents possess significant geometric and statistical properties.

Abstract

Sparse Auto-Encoders (SAEs) are commonly employed in mechanistic interpretability to decompose the residual stream into monosemantic SAE latents. Recent work demonstrates that perturbing a model's activations at an early layer results in a step-function-like change in the model's final layer activations. Furthermore, the model's sensitivity to this perturbation differs between model-generated (real) activations and random activations. In our study, we assess model sensitivity in order to compare real activations to synthetic activations composed of SAE latents. Our findings indicate that synthetic activations closely resemble real activations when we control for the sparsity and cosine similarity of the constituent SAE latents. This suggests that real activations cannot be explained by a simple "bag of SAE latents" lacking internal structure, and instead suggests that SAE latents possess significant geometric and statistical properties. Notably, we observe that our synthetic activations exhibit less pronounced activation plateaus compared to those typically surrounding real activations.

Evaluating Synthetic Activations composed of SAE Latents in GPT-2

TL;DR

The findings indicate that synthetic activations closely resemble real activations when the authors control for the sparsity and cosine similarity of the constituent SAE latents, which suggests that real activations cannot be explained by a simple"bag of SAE latents"lacking internal structure, and instead suggests that SAE latents possess significant geometric and statistical properties.

Abstract

Sparse Auto-Encoders (SAEs) are commonly employed in mechanistic interpretability to decompose the residual stream into monosemantic SAE latents. Recent work demonstrates that perturbing a model's activations at an early layer results in a step-function-like change in the model's final layer activations. Furthermore, the model's sensitivity to this perturbation differs between model-generated (real) activations and random activations. In our study, we assess model sensitivity in order to compare real activations to synthetic activations composed of SAE latents. Our findings indicate that synthetic activations closely resemble real activations when we control for the sparsity and cosine similarity of the constituent SAE latents. This suggests that real activations cannot be explained by a simple "bag of SAE latents" lacking internal structure, and instead suggests that SAE latents possess significant geometric and statistical properties. Notably, we observe that our synthetic activations exhibit less pronounced activation plateaus compared to those typically surrounding real activations.
Paper Structure (21 sections, 3 equations, 9 figures, 6 tables)

This paper contains 21 sections, 3 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: The L2 distance after Layer 11 (left) and the KL divergence of the next-token prediction probabilities (right) between the perturbed and unperturbed model, as three base activations at Layer 1 are slowly perturbed towards model-generated activations (orange) and random points sampled from the distribution of model activations (blue). The x-axis represents the total length of the perturbation broken into $100$ steps of size $0.5$ each. The dot on each solid line represents the maximum slope (MS) step for each perturbation. The dashed lines represent the average L2 distance and KL divergence per step for $1000$ perturbations of both types. The linear part at the start of the curves represents the activation plateau, and the sharp rise in the curves represents the blowup.
  • Figure 2: The distributions of the max slope (MS) steps for perturbations towards model-generated (orange), random (blue), synthetic-baseline (purple), and synthetic-structured (green) activations. The left panel shows the counts of MS steps occurring in different bins along the length of the perturbation, and the right panel shows corresponding cumulative frequency. We find that perturbing towards synthetic-structured activations is more similar to perturbing towards model-generated activations as compared to perturbing towards synthetic-baseline activations.
  • Figure 3: The distributions of the activation plateau (AP) steps for perturbations starting at model-generated, random, synthetic-baseline, and synthetic-structured activations. We perturb towards random activations in all cases. The left panel shows the counts of AP steps occurring in different bins along the length of the perturbation, and the right panel shows the cumulative frequency for the same. We find that model-generated activations (orange) have flatter plateaus around them than all of the other activation types. We also see that synthetic-baseline activations (purple) have the steepest plateaus around them, while plateaus around synthetic-structured (green) and random (blue) activations look similar.
  • Figure A.1: The distributions of the max slope (MS) steps for perturbations with relative step size towards model-generated (orange), random (blue), synthetic-baseline (purple), and synthetic-structured (green) activations. The left panel shows the counts of MS steps occurring in different bins along the length of the perturbation, and the right panel shows the cumulative frequency for the same. We find that perturbing towards synthetic-structured activations in the relative step size setup is slightly more similar to perturbing towards model-generated activations than perturbing towards synthetic-baseline activations is.
  • Figure B.1: The distributions of the AUC steps for perturbations with absolute step size (top) and relative step size (bottom) towards model-generated (orange), random (blue), synthetic-baseline (purple), and synthetic-structured (green) activations. The left column shows the counts of AUC steps occurring in different bins along the length of the perturbation, and the right column shows the cumulative frequency for the same. We find that our results for the AUC step distributions are similar to those for the MS step distributions.
  • ...and 4 more figures