Table of Contents
Fetching ...

SynthSAEBench: Evaluating Sparse Autoencoders on Scalable Realistic Synthetic Data

David Chanin, Adrià Garriga-Alonso

TL;DR

SynthSAEBench is introduced, a toolkit for generating large-scale synthetic data with realistic feature characteristics including correlation, hierarchy, and superposition, and a standardized benchmark model, SynthSAEBench-16k, enabling direct comparison of SAE architectures, and is used to identify a new failure mode.

Abstract

Improving Sparse Autoencoders (SAEs) requires benchmarks that can precisely validate architectural innovations. However, current SAE benchmarks on LLMs are often too noisy to differentiate architectural improvements, and current synthetic data experiments are too small-scale and unrealistic to provide meaningful comparisons. We introduce SynthSAEBench, a toolkit for generating large-scale synthetic data with realistic feature characteristics including correlation, hierarchy, and superposition, and a standardized benchmark model, SynthSAEBench-16k, enabling direct comparison of SAE architectures. Our benchmark reproduces several previously observed LLM SAE phenomena, including the disconnect between reconstruction and latent quality metrics, poor SAE probing results, and a precision-recall trade-off mediated by L0. We further use our benchmark to identify a new failure mode: Matching Pursuit SAEs exploit superposition noise to improve reconstruction without learning ground-truth features, suggesting that more expressive encoders can easily overfit. SynthSAEBench complements LLM benchmarks by providing ground-truth features and controlled ablations, enabling researchers to precisely diagnose SAE failure modes and validate architectural improvements before scaling to LLMs.

SynthSAEBench: Evaluating Sparse Autoencoders on Scalable Realistic Synthetic Data

TL;DR

SynthSAEBench is introduced, a toolkit for generating large-scale synthetic data with realistic feature characteristics including correlation, hierarchy, and superposition, and a standardized benchmark model, SynthSAEBench-16k, enabling direct comparison of SAE architectures, and is used to identify a new failure mode.

Abstract

Improving Sparse Autoencoders (SAEs) requires benchmarks that can precisely validate architectural innovations. However, current SAE benchmarks on LLMs are often too noisy to differentiate architectural improvements, and current synthetic data experiments are too small-scale and unrealistic to provide meaningful comparisons. We introduce SynthSAEBench, a toolkit for generating large-scale synthetic data with realistic feature characteristics including correlation, hierarchy, and superposition, and a standardized benchmark model, SynthSAEBench-16k, enabling direct comparison of SAE architectures. Our benchmark reproduces several previously observed LLM SAE phenomena, including the disconnect between reconstruction and latent quality metrics, poor SAE probing results, and a precision-recall trade-off mediated by L0. We further use our benchmark to identify a new failure mode: Matching Pursuit SAEs exploit superposition noise to improve reconstruction without learning ground-truth features, suggesting that more expressive encoders can easily overfit. SynthSAEBench complements LLM benchmarks by providing ground-truth features and controlled ablations, enabling researchers to precisely diagnose SAE failure modes and validate architectural improvements before scaling to LLMs.
Paper Structure (74 sections, 30 equations, 18 figures, 1 table)

This paper contains 74 sections, 30 equations, 18 figures, 1 table.

Figures (18)

  • Figure 1: SynthSAEBench provides a large-scale synthetic data model with realistic feature characteristics including correlation, hierarchy, superposition and zipfian firing distributions, scalable to hundreds of thousands of features and realistic hidden dimension sizes.
  • Figure 2: Overview of process to generate a single training activation, $a$.
  • Figure 3: SynthSAEBench-16k feature firing probabilities.
  • Figure 4: SynthSAEBench-16k hierarchy distribution.
  • Figure 5: Variance explained (left), MCC (middle), and F1-score (right) for SAEs trained on SynthSAEBench-16k across varying L0 values. Shaded area is stdev with 5 seeds (too small to be visible for most SAEs).
  • ...and 13 more figures