Provably Extracting the Features from a General Superposition

Allen Liu

Provably Extracting the Features from a General Superposition

Allen Liu

TL;DR

The paper tackles the challenge of learning a sum of ridge-function features in an overcomplete setting, where the number of features exceeds the ambient dimension. It introduces a Fourier-sparsity perspective and a Gaussian-smoothed framework to locate hidden feature directions via high-dimensional Fourier mass, using a carefully designed frequency-finding procedure with mass-estimation oracles. The central contributions are (i) identifiability and efficient recovery of nonlinear feature directions under mild nondegeneracy and Lipschitz assumptions, and (ii) a robust function-recovery procedure that reconstructs the target function to a specified accuracy on a bounded domain, with a route to removing boundedness assumptions. Together, these results significantly generalize prior work by handling arbitrary activations and correlated directions, and by providing a practical, query-efficient algorithm for feature extraction under superposition. The approach offers potential implications for model extraction and interpretability in settings where a trained model behaves as a sum of nonlinear ridge components planted in a high-dimensional space.

Abstract

It is widely believed that complex machine learning models generally encode features through linear representations, but these features exist in superposition, making them challenging to recover. We study the following fundamental setting for learning features in superposition from black-box query access: we are given query access to a function \[ f(x)=\sum_{i=1}^n a_i\,σ_i(v_i^\top x), \] where each unit vector $v_i$ encodes a feature direction and $σ_i:\mathbb{R} \rightarrow \mathbb{R}$ is an arbitrary response function and our goal is to recover the $v_i$ and the function $f$. In learning-theoretic terms, superposition refers to the overcomplete regime, when the number of features is larger than the underlying dimension (i.e. $n > d$), which has proven especially challenging for typical algorithmic approaches. Our main result is an efficient query algorithm that, from noisy oracle access to $f$, identifies all feature directions whose responses are non-degenerate and reconstructs the function $f$. Crucially, our algorithm works in a significantly more general setting than all related prior results -- we allow for essentially arbitrary superpositions, only requiring that $v_i, v_j$ are not nearly identical for $i \neq j$, and general response functions $σ_i$. At a high level, our algorithm introduces an approach for searching in Fourier space by iteratively refining the search space to locate the hidden directions $v_i$.

Provably Extracting the Features from a General Superposition

TL;DR

Abstract

Provably Extracting the Features from a General Superposition

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (56)