Table of Contents
Fetching ...

DB-KSVD: Scalable Alternating Optimization for Disentangling High-Dimensional Embedding Spaces

Romeo Valentin, Sydney M. Katz, Vincent Vanhoucke, Mykel J. Kochenderfer

TL;DR

The paper tackles disentangling high-dimensional transformer embeddings using scalable dictionary learning. It introduces DB-KSVD, an alternating-optimization adaptation of KSVD capable of handling millions of samples and thousands of features, and augments it with Matryoshka structuring to impose inductive bias. Empirically, DB-KSVD achieves competitive SAEBench performance compared to sparse autoencoders and demonstrates substantial speedups and scalability, while analysis of dictionary coherence links interpretability gains to structural priors. The work provides practical implementation strategies and suggests that traditional optimization can be effectively scaled to mechanistic interpretability tasks with large transformer embeddings, offering a complementary direction to SAE-based approaches.

Abstract

Dictionary learning has recently emerged as a promising approach for mechanistic interpretability of large transformer models. Disentangling high-dimensional transformer embeddings, however, requires algorithms that scale to high-dimensional data with large sample sizes. Recent work has explored sparse autoencoders (SAEs) for this problem. However, SAEs use a simple linear encoder to solve the sparse encoding subproblem, which is known to be NP-hard. It is therefore interesting to understand whether this structure is sufficient to find good solutions to the dictionary learning problem or if a more sophisticated algorithm could find better solutions. In this work, we propose Double-Batch KSVD (DB-KSVD), a scalable dictionary learning algorithm that adapts the classic KSVD algorithm. DB-KSVD is informed by the rich theoretical foundations of KSVD but scales to datasets with millions of samples and thousands of dimensions. We demonstrate the efficacy of DB-KSVD by disentangling embeddings of the Gemma-2-2B model and evaluating on six metrics from the SAEBench benchmark, where we achieve competitive results when compared to established approaches based on SAEs. By matching SAE performance with an entirely different optimization approach, our results suggest that (i) SAEs do find strong solutions to the dictionary learning problem and (ii) that traditional optimization approaches can be scaled to the required problem sizes, offering a promising avenue for further research. We provide an implementation of DB-KSVD at https://github.com/RomeoV/KSVD.jl.

DB-KSVD: Scalable Alternating Optimization for Disentangling High-Dimensional Embedding Spaces

TL;DR

The paper tackles disentangling high-dimensional transformer embeddings using scalable dictionary learning. It introduces DB-KSVD, an alternating-optimization adaptation of KSVD capable of handling millions of samples and thousands of features, and augments it with Matryoshka structuring to impose inductive bias. Empirically, DB-KSVD achieves competitive SAEBench performance compared to sparse autoencoders and demonstrates substantial speedups and scalability, while analysis of dictionary coherence links interpretability gains to structural priors. The work provides practical implementation strategies and suggests that traditional optimization can be effectively scaled to mechanistic interpretability tasks with large transformer embeddings, offering a complementary direction to SAE-based approaches.

Abstract

Dictionary learning has recently emerged as a promising approach for mechanistic interpretability of large transformer models. Disentangling high-dimensional transformer embeddings, however, requires algorithms that scale to high-dimensional data with large sample sizes. Recent work has explored sparse autoencoders (SAEs) for this problem. However, SAEs use a simple linear encoder to solve the sparse encoding subproblem, which is known to be NP-hard. It is therefore interesting to understand whether this structure is sufficient to find good solutions to the dictionary learning problem or if a more sophisticated algorithm could find better solutions. In this work, we propose Double-Batch KSVD (DB-KSVD), a scalable dictionary learning algorithm that adapts the classic KSVD algorithm. DB-KSVD is informed by the rich theoretical foundations of KSVD but scales to datasets with millions of samples and thousands of dimensions. We demonstrate the efficacy of DB-KSVD by disentangling embeddings of the Gemma-2-2B model and evaluating on six metrics from the SAEBench benchmark, where we achieve competitive results when compared to established approaches based on SAEs. By matching SAE performance with an entirely different optimization approach, our results suggest that (i) SAEs do find strong solutions to the dictionary learning problem and (ii) that traditional optimization approaches can be scaled to the required problem sizes, offering a promising avenue for further research. We provide an implementation of DB-KSVD at https://github.com/RomeoV/KSVD.jl.

Paper Structure

This paper contains 29 sections, 7 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Results (higher is better) of our DB-KSVD algorithm and Matryoshka adaptation on the SAEBench benchmark with 4096.0 dictionary elements.
  • Figure 2: Histograms of element-wise coherence metrics for different sparsities. The dictionaries are constructed using the Gemma-2-2B embeddings and comprise $m=4096$ dictionary elements.
  • Figure 3: Convergence of the DB-KSVD algorithm with varying fractions of known adjacencies $\gamma$. We plot the mean relative error at each iteration, defined as $\frac{1}{n}\sum_i \|y_i - Dx_i\|_2/\|y_i\|_2$.
  • Figure 4: Results (higher is better) of our DB-KSVD algorithm and Matryoshka adaptation on the SAEBench benchmark with 16384.0 dictionary elements.
  • Figure 5: Histograms of element-wise coherence metrics for different sparsities. The dictionaries are constructed using the Gemma-2-2B embeddings and comprise $m=16384.0$ dictionary elements.
  • ...and 2 more figures