Automated Feature Labeling with Token-Space Gradient Descent
Julian Schulz, Seamus Fallows
TL;DR
This work introduces a token-space gradient-descent framework for automated feature labeling, replacing hypothesis-generation by language models with a differentiable discriminator that predicts feature activations. A label prototype is encoded as a vector $oldsymbol{v}$ over the vocabulary, yielding $oldsymbol{p} = \text{softmax}(oldsymbol{v})$ and $oldsymbol{e} = E\boldsymbol{p}$ to permit token superposition, and the objective combines $L_{ ext{acc}}$, entropy regularization, and KL-divergence to promote linguistic naturalness. The method is evaluated on synthetic features across animals, mammals, Chinese text, and numbers, showing convergence to interpretable single-token labels in several cases, but revealing limitations with data balance, multi-token labels, and evaluator capabilities. Overall, token-space gradient descent offers a promising, model-agnostic complement to hypothesis-driven labeling that could strengthen interpretability workflows and AI-safety analyses by providing a differentiable, alternative labeling mechanism.
Abstract
We present a novel approach to feature labeling using gradient descent in token-space. While existing methods typically use language models to generate hypotheses about feature meanings, our method directly optimizes label representations by using a language model as a discriminator to predict feature activations. We formulate this as a multi-objective optimization problem in token-space, balancing prediction accuracy, entropy minimization, and linguistic naturalness. Our proof-of-concept experiments demonstrate successful convergence to interpretable single-token labels across diverse domains, including features for detecting animals, mammals, Chinese text, and numbers. Although our current implementation is constrained to single-token labels and relatively simple features, the results suggest that token-space gradient descent could become a valuable addition to the interpretability researcher's toolkit.
