Table of Contents
Fetching ...

Ensembling Sparse Autoencoders

Soham Gadgil, Chris Lin, Su-In Lee

TL;DR

The empirical results demonstrate that ensembling SAEs can improve the reconstruction of language model activations, diversity of features, and SAE stability, and performs better than applying a single SAE on downstream tasks such as concept detection and spurious correlation removal, showing improved practical utility.

Abstract

Sparse autoencoders (SAEs) are used to decompose neural network activations into human-interpretable features. Typically, features learned by a single SAE are used for downstream applications. However, it has recently been shown that SAEs trained with different initial weights can learn different features, demonstrating that a single SAE captures only a limited subset of features that can be extracted from the activation space. Motivated by this limitation, we propose to ensemble multiple SAEs through naive bagging and boosting. Specifically, SAEs trained with different weight initializations are ensembled in naive bagging, whereas SAEs sequentially trained to minimize the residual error are ensembled in boosting. We evaluate our ensemble approaches with three settings of language models and SAE architectures. Our empirical results demonstrate that ensembling SAEs can improve the reconstruction of language model activations, diversity of features, and SAE stability. Furthermore, ensembling SAEs performs better than applying a single SAE on downstream tasks such as concept detection and spurious correlation removal, showing improved practical utility.

Ensembling Sparse Autoencoders

TL;DR

The empirical results demonstrate that ensembling SAEs can improve the reconstruction of language model activations, diversity of features, and SAE stability, and performs better than applying a single SAE on downstream tasks such as concept detection and spurious correlation removal, showing improved practical utility.

Abstract

Sparse autoencoders (SAEs) are used to decompose neural network activations into human-interpretable features. Typically, features learned by a single SAE are used for downstream applications. However, it has recently been shown that SAEs trained with different initial weights can learn different features, demonstrating that a single SAE captures only a limited subset of features that can be extracted from the activation space. Motivated by this limitation, we propose to ensemble multiple SAEs through naive bagging and boosting. Specifically, SAEs trained with different weight initializations are ensembled in naive bagging, whereas SAEs sequentially trained to minimize the residual error are ensembled in boosting. We evaluate our ensemble approaches with three settings of language models and SAE architectures. Our empirical results demonstrate that ensembling SAEs can improve the reconstruction of language model activations, diversity of features, and SAE stability. Furthermore, ensembling SAEs performs better than applying a single SAE on downstream tasks such as concept detection and spurious correlation removal, showing improved practical utility.

Paper Structure

This paper contains 24 sections, 5 theorems, 35 equations, 2 figures, 3 tables, 1 algorithm.

Key Result

Proposition 1

Suppose there are $J$ SAEs $g(\cdot; {\mathbf{\theta}}^{(1)}), ..., g(\cdot; {\mathbf{\theta}}^{(J)})$, with decoder matrices ${{\mathbf{W}}_{\text{dec}}}^{(1)}, ..., {{\mathbf{W}}_{\text{dec}}}^{(J)} \in {\mathbb{R}}^{d \times k}$ and decoder biases ${{\mathbf{b}}_{\text{dec}}}^{(1)}, ..., {{\mathb where and ${\mathbf{f}}_{i'} = {{\mathbf{W}}_{\text{dec}}}[:, i']$, with ${\mathbf{c}} \in {\mathb

Figures (2)

  • Figure 1: Overview of the proposed SAE ensembling strategies. a.Naive Bagging involves multiple SAEs with different weight initializations, which can be trained in parallel. The ensembled reconstruction is the average of reconstructions obtained from each individual SAE. b.Boosting involves sequential training of SAEs on the residual error left from the previous iterations. The ensembled reconstruction is the sum of the reconstructions of the individual SAEs. c. For both approaches, ensembling the features and feature coefficients involves a concatenation.
  • Figure 2: Effect of the the number of SAEs in the ensemble for naive bagging and boosting on the intrinsic evaluation metrics for Gemma 2-2B. The shaded regions indicate 95% confidence intervals across 5 different experiment runs. For naive bagging, the different experiment runs correspond to different sets of initial weights.

Theorems & Definitions (15)

  • Proposition 1
  • Remark 1
  • Proposition 1
  • proof
  • Lemma 1
  • proof
  • Proposition 2
  • proof
  • Remark 2
  • Remark 3
  • ...and 5 more