Table of Contents
Fetching ...

Dictionary Learning: The Complexity of Learning Sparse Superposed Features with Feedback

Akash Kumar

TL;DR

The paper studies how to recover latent feature dictionaries encoded in a matrix $\boldsymbol{\sf D}$ (with atoms $u_i$) by querying an agent with relative triplet feedback to learn the feature matrix $\boldsymbol{\Phi}=\boldsymbol{\sf D}\boldsymbol{\sf D}^\top$. It develops a theory of feedback complexity under constructive and sampled settings, deriving tight bounds that depend on the rank $r$ of $\boldsymbol{\Phi}$ and the activation sparsity, including reductions to pairwise equalities and low-rank decompositions. Key results show worst-case $\Omega(p^2)$ lower bounds for full-rank cases, improved bounds for low-rank matrices, and upper bounds under 2-sparse constructive feedback; in sampling regimes, $\Theta(p(p+1)/2)$ feedbacks suffice almost surely with Lebesgue-distributed activations, with sparsity-modulated bounds. Experimentally, the framework is validated on Recursive Feature Machines and sparse autoencoder dictionaries from large language models, demonstrating efficient feature recovery under appropriate structure and addressing memory and computation challenges with low-rank factorization. Overall, the work clarifies when efficient feature retrieval is possible from minimal, semantically meaningful feedback and highlights the importance of low-rank structure for practical scalability in neural-feature interpretability.

Abstract

The success of deep networks is crucially attributed to their ability to capture latent features within a representation space. In this work, we investigate whether the underlying learned features of a model can be efficiently retrieved through feedback from an agent, such as a large language model (LLM), in the form of relative \tt{triplet comparisons}. These features may represent various constructs, including dictionaries in LLMs or a covariance matrix of Mahalanobis distances. We analyze the feedback complexity associated with learning a feature matrix in sparse settings. Our results establish tight bounds when the agent is permitted to construct activations and demonstrate strong upper bounds in sparse scenarios when the agent's feedback is limited to distributional information. We validate our theoretical findings through experiments on two distinct applications: feature recovery from Recursive Feature Machines and dictionary extraction from sparse autoencoders trained on Large Language Models.

Dictionary Learning: The Complexity of Learning Sparse Superposed Features with Feedback

TL;DR

The paper studies how to recover latent feature dictionaries encoded in a matrix (with atoms ) by querying an agent with relative triplet feedback to learn the feature matrix . It develops a theory of feedback complexity under constructive and sampled settings, deriving tight bounds that depend on the rank of and the activation sparsity, including reductions to pairwise equalities and low-rank decompositions. Key results show worst-case lower bounds for full-rank cases, improved bounds for low-rank matrices, and upper bounds under 2-sparse constructive feedback; in sampling regimes, feedbacks suffice almost surely with Lebesgue-distributed activations, with sparsity-modulated bounds. Experimentally, the framework is validated on Recursive Feature Machines and sparse autoencoder dictionaries from large language models, demonstrating efficient feature recovery under appropriate structure and addressing memory and computation challenges with low-rank factorization. Overall, the work clarifies when efficient feature retrieval is possible from minimal, semantically meaningful feedback and highlights the importance of low-rank structure for practical scalability in neural-feature interpretability.

Abstract

The success of deep networks is crucially attributed to their ability to capture latent features within a representation space. In this work, we investigate whether the underlying learned features of a model can be efficiently retrieved through feedback from an agent, such as a large language model (LLM), in the form of relative \tt{triplet comparisons}. These features may represent various constructs, including dictionaries in LLMs or a covariance matrix of Mahalanobis distances. We analyze the feedback complexity associated with learning a feature matrix in sparse settings. Our results establish tight bounds when the agent is permitted to construct activations and demonstrate strong upper bounds in sparse scenarios when the agent's feedback is limited to distributional information. We validate our theoretical findings through experiments on two distinct applications: feature recovery from Recursive Feature Machines and dictionary extraction from sparse autoencoders trained on Large Language Models.

Paper Structure

This paper contains 41 sections, 16 theorems, 112 equations, 11 figures, 1 table, 3 algorithms.

Key Result

Lemma 1

Assume $\boldsymbol{\Phi} \in \textsf{Sym}_{+}(\mathbb{R}^{p\times p})$ . Define the set of orthogonal Cholesky decompositions of $\boldsymbol{\Phi}$ as where $r = \text{rank}(\boldsymbol{\Phi})$ and $\lambda_1, \lambda_2, \ldots, \lambda_r$ are the eigenvalues of $\boldsymbol{\Phi}$ in descending order. Then, for any two matrices $\textbf{U}, \textbf{U}' \in {\mathcal{W}}_{\sf{CD}}$, there exist

Figures (11)

  • Figure 1: Features via Recursive Feature Machines. We perform monomial regression on $z\sim\mathcal{N}(0,0.5I_{10})$ with target $f^*(z)=z_0\,z_1\,\mathbf{1}(z_5>0)$. An RFM kernel machine $\hat{f}_{\boldsymbol{\Phi}}(z)=\sum_{y_i\in\mathcal{D}_{\mathrm{train}}}a_iK_{\boldsymbol{\Phi}}(y_i,z)$ is trained for 5 iterations on 4000 samples to produce the ground‐truth feature matrix $\boldsymbol{\Phi}^*$ of rank 4 rfm. We then query an agent for feedback via: eigendecomposition (Theorem \ref{['thm: constructgeneral']}), sparse constructive (Theorem \ref{['thm: constructsparse']}), random Gaussian sampling (Theorem \ref{['thm: samplegeneral']}), and sparse sampling with $\mu=0.9$ (Theorem \ref{['thm: samplingsparse']}). Eigendecomposition, sparse constructive, and random sampling achieve the ground‐truth MSE with only 55 feedbacks, whereas high‐sparsity sampling yields inferior features and larger MSE.
  • Figure 2: Sparse sampling: We consider the same setup as Fig. \ref{['fig: monoconst']} for the target function $f^*(z) = z_0 z_1 z_3 \mathbf{1}(z_5 > 0)$. In these plots, we employ sparse sampling feedback methods where an agent provides feedback based on $\boldsymbol{\Phi}^*$ with different sparsity probability ($mu$: probability of 0 being sampled). Thus, as $mu$ decreases, the theorized complexity of $p(p+1)/2 = 55$ obtains a close approximation of $\boldsymbol{\Phi}^*$. But for $mu = .97$, the agent needs to sample more number of activations to approximate properly, i.e., from 55, 110, $\ldots$, and 1100 approximation gradually improves as shown in Theorem \ref{['thm: samplingsparse']}.
  • Figure 3: Top: Feature-recovery quality as a function of feedback for a dictionary (of dimension $4096\times512$) from an SAE trained for ChessGPT. Bottom: numeric PCC and feedback for each method. Sparse constructive achieves almost perfect correlation (0.9773) in only $\approx8.4$M queries; sampling with smaller feedback sizes struggle until $\gtrsim4$M samples.
  • Figure 4: Feature learning on a subsampled dictionary of dimension $4500 \times 512$ of SAE trained for Pythia-70M. Theorem \ref{['thm: constructgeneral']} states that Eigendecompostion method requires 135316 constructive feedback. After a few 100 iterations of gradient descent as shown in Algorithm \ref{['alg:gradient']}, a PCC of 93% is achieved on ground truth. For visualization, only the first 100 dimensions are used.
  • Figure 5: Sparse sampling for Pythia-70M: Dimension of feature matrix: $32768 \times 512$ and the rank is 215. Plots for varying feedback complexity sizes. Note that $p(p+1)/2 \approx$ 512M. We run experiments with 3-sparse activations for uniform sparse distributions. The Pearson Correlation Coefficient (PCC) to feedback size (PCC, Feedback size) improves as follows: $(200k, .0242), (2M, .38), (5M, .54),(10M, .65)$, and $(20M, .77)$.
  • ...and 6 more figures

Theorems & Definitions (34)

  • Definition 1: $s$-sparse activations
  • Definition 2: Feature equivalence
  • Definition 3: Oblivious learner
  • Lemma 1: Recovering orthogonal representations
  • Lemma 2
  • proof
  • Proposition 1
  • proof : Proof Outline
  • Lemma 3
  • Theorem 1: General Activations
  • ...and 24 more