Dictionary Learning: The Complexity of Learning Sparse Superposed Features with Feedback
Akash Kumar
TL;DR
The paper studies how to recover latent feature dictionaries encoded in a matrix $\boldsymbol{\sf D}$ (with atoms $u_i$) by querying an agent with relative triplet feedback to learn the feature matrix $\boldsymbol{\Phi}=\boldsymbol{\sf D}\boldsymbol{\sf D}^\top$. It develops a theory of feedback complexity under constructive and sampled settings, deriving tight bounds that depend on the rank $r$ of $\boldsymbol{\Phi}$ and the activation sparsity, including reductions to pairwise equalities and low-rank decompositions. Key results show worst-case $\Omega(p^2)$ lower bounds for full-rank cases, improved bounds for low-rank matrices, and upper bounds under 2-sparse constructive feedback; in sampling regimes, $\Theta(p(p+1)/2)$ feedbacks suffice almost surely with Lebesgue-distributed activations, with sparsity-modulated bounds. Experimentally, the framework is validated on Recursive Feature Machines and sparse autoencoder dictionaries from large language models, demonstrating efficient feature recovery under appropriate structure and addressing memory and computation challenges with low-rank factorization. Overall, the work clarifies when efficient feature retrieval is possible from minimal, semantically meaningful feedback and highlights the importance of low-rank structure for practical scalability in neural-feature interpretability.
Abstract
The success of deep networks is crucially attributed to their ability to capture latent features within a representation space. In this work, we investigate whether the underlying learned features of a model can be efficiently retrieved through feedback from an agent, such as a large language model (LLM), in the form of relative \tt{triplet comparisons}. These features may represent various constructs, including dictionaries in LLMs or a covariance matrix of Mahalanobis distances. We analyze the feedback complexity associated with learning a feature matrix in sparse settings. Our results establish tight bounds when the agent is permitted to construct activations and demonstrate strong upper bounds in sparse scenarios when the agent's feedback is limited to distributional information. We validate our theoretical findings through experiments on two distinct applications: feature recovery from Recursive Feature Machines and dictionary extraction from sparse autoencoders trained on Large Language Models.
