Table of Contents
Fetching ...

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

Daniel J. Lee, Stefan Heimersheim

TL;DR

It is demonstrated that KL divergence for Sparse Autoencoder (SAE) reconstruction errors are no longer pathologically high compared to the improved baseline, and it is found that end-to-end SAE features do not exhibit stronger effects on model outputs compared to traditional SAEs.

Abstract

Sensitive directions experiments attempt to understand the computational features of Language Models (LMs) by measuring how much the next token prediction probabilities change by perturbing activations along specific directions. We extend the sensitive directions work by introducing an improved baseline for perturbation directions. We demonstrate that KL divergence for Sparse Autoencoder (SAE) reconstruction errors are no longer pathologically high compared to the improved baseline. We also show that feature directions uncovered by SAEs have varying impacts on model outputs depending on the SAE's sparsity, with lower L0 SAE feature directions exerting a greater influence. Additionally, we find that end-to-end SAE features do not exhibit stronger effects on model outputs compared to traditional SAEs.

Investigating Sensitive Directions in GPT-2: An Improved Baseline and Comparative Analysis of SAEs

TL;DR

It is demonstrated that KL divergence for Sparse Autoencoder (SAE) reconstruction errors are no longer pathologically high compared to the improved baseline, and it is found that end-to-end SAE features do not exhibit stronger effects on model outputs compared to traditional SAEs.

Abstract

Sensitive directions experiments attempt to understand the computational features of Language Models (LMs) by measuring how much the next token prediction probabilities change by perturbing activations along specific directions. We extend the sensitive directions work by introducing an improved baseline for perturbation directions. We demonstrate that KL divergence for Sparse Autoencoder (SAE) reconstruction errors are no longer pathologically high compared to the improved baseline. We also show that feature directions uncovered by SAEs have varying impacts on model outputs depending on the SAE's sparsity, with lower L0 SAE feature directions exerting a greater influence. Additionally, we find that end-to-end SAE features do not exhibit stronger effects on model outputs compared to traditional SAEs.

Paper Structure

This paper contains 18 sections, 3 equations, 7 figures.

Figures (7)

  • Figure 1: We vary the perturbation length for perturbations in Layer 6 resid_pre. (a) We compare the difference versus mixture perturbations. For both cov-random (left) and real (right) cases, the difference perturbations have a greater change in model output than mixture perturbations. (b) We compare the cov-random versus real baselines for difference (left) and mixture (right) types.
  • Figure 2: Comparison of the average KL divergence of four different substitution types. On the x-axis we have different GPT2-small layers. SAE from Bloom2024-bi was used.
  • Figure 3: We vary the perturbation length for perturbations in Layer 6 resid_pre. For each columns we show different SAE model types. We compare the SAE reconstruction error directions with cov-random mixture and isotropic random directions. We color the lines by different $L0$ values of the SAEs. (b) is the same as (a), but with a reduced x-axis limit.
  • Figure 4: This plot varies the perturbation length for SAE feature directions in Layer 6 resid_pre. For the three columns, we compare the three different SAE model types . We color the lines by different L0 values of the SAEs.
  • Figure 5: Comparison of the change in model output for various perturbation lengths for different SAE feature directions and baselines in Layer 6 resid_pre.
  • ...and 2 more figures