Table of Contents
Fetching ...

Automated Feature Labeling with Token-Space Gradient Descent

Julian Schulz, Seamus Fallows

TL;DR

This work introduces a token-space gradient-descent framework for automated feature labeling, replacing hypothesis-generation by language models with a differentiable discriminator that predicts feature activations. A label prototype is encoded as a vector $oldsymbol{v}$ over the vocabulary, yielding $oldsymbol{p} = \text{softmax}(oldsymbol{v})$ and $oldsymbol{e} = E\boldsymbol{p}$ to permit token superposition, and the objective combines $L_{ ext{acc}}$, entropy regularization, and KL-divergence to promote linguistic naturalness. The method is evaluated on synthetic features across animals, mammals, Chinese text, and numbers, showing convergence to interpretable single-token labels in several cases, but revealing limitations with data balance, multi-token labels, and evaluator capabilities. Overall, token-space gradient descent offers a promising, model-agnostic complement to hypothesis-driven labeling that could strengthen interpretability workflows and AI-safety analyses by providing a differentiable, alternative labeling mechanism.

Abstract

We present a novel approach to feature labeling using gradient descent in token-space. While existing methods typically use language models to generate hypotheses about feature meanings, our method directly optimizes label representations by using a language model as a discriminator to predict feature activations. We formulate this as a multi-objective optimization problem in token-space, balancing prediction accuracy, entropy minimization, and linguistic naturalness. Our proof-of-concept experiments demonstrate successful convergence to interpretable single-token labels across diverse domains, including features for detecting animals, mammals, Chinese text, and numbers. Although our current implementation is constrained to single-token labels and relatively simple features, the results suggest that token-space gradient descent could become a valuable addition to the interpretability researcher's toolkit.

Automated Feature Labeling with Token-Space Gradient Descent

TL;DR

This work introduces a token-space gradient-descent framework for automated feature labeling, replacing hypothesis-generation by language models with a differentiable discriminator that predicts feature activations. A label prototype is encoded as a vector over the vocabulary, yielding and to permit token superposition, and the objective combines , entropy regularization, and KL-divergence to promote linguistic naturalness. The method is evaluated on synthetic features across animals, mammals, Chinese text, and numbers, showing convergence to interpretable single-token labels in several cases, but revealing limitations with data balance, multi-token labels, and evaluator capabilities. Overall, token-space gradient descent offers a promising, model-agnostic complement to hypothesis-driven labeling that could strengthen interpretability workflows and AI-safety analyses by providing a differentiable, alternative labeling mechanism.

Abstract

We present a novel approach to feature labeling using gradient descent in token-space. While existing methods typically use language models to generate hypotheses about feature meanings, our method directly optimizes label representations by using a language model as a discriminator to predict feature activations. We formulate this as a multi-objective optimization problem in token-space, balancing prediction accuracy, entropy minimization, and linguistic naturalness. Our proof-of-concept experiments demonstrate successful convergence to interpretable single-token labels across diverse domains, including features for detecting animals, mammals, Chinese text, and numbers. Although our current implementation is constrained to single-token labels and relatively simple features, the results suggest that token-space gradient descent could become a valuable addition to the interpretability researcher's toolkit.

Paper Structure

This paper contains 23 sections, 6 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: A feature label can be evaluated by how well it enables prediction of feature activations. Here, a human uses the label "Animal" to predict which tokens in the input text will have high feature activation values.
  • Figure 2: Our method replaces the human evaluator with a language model, making the process differentiable. The bottom plot shows the optimization trajectory in token-space, where the probability mass concentrates on the token "animal" after several hundred steps.
  • Figure 3: Optimization trajectories showing token probability distributions over training steps for successfully labeled features. Each panel represents a different synthetic feature trained on distinct text corpora. The winning token is shown in red, while the next 9 highest-probability tokens during optimization are shown in blue shades. Alternative valid descriptive tokens are traced with red dotted lines. (a) Animal detection feature: converges to "animal" after 500 steps, with semantically related tokens (e.g., "Anim", "rabbit") showing temporary prominence. (b) Mammal-specific feature: converges to "mamm" when trained on a balanced dataset of mammal and non-mammal animals. (c) Chinese text detection: shows sudden convergence to "中文"(meaning "Chinese") at step 280. (d) Number detection: quick convergence to "number" for a feature active on both numerical digits and number words, with related tokens like "num" and "nummer" appearing during optimization. Training data and hyperparameters are detailed in appendix \ref{['sec:data']}.
  • Figure 4: Optimization trajectories for unsuccessful feature labeling attempts. Format matches Figure \ref{['fig:success_trajectories']}, with winning tokens in red and top competing tokens in blue. (a) Failed mammal detection: when trained on natural text without balanced sampling, the optimization converges to the broader category "animal" instead of the intended mammal-specific label. (b) Failed palindrome detection: optimization defaults to the simple token "a", due to base model's inability to reliably classify palindromes even with correct labeling. Complete training data and hyperparameters are provided in appendix \ref{['sec:data']}.
  • Figure 5: Animal Text Dataset: Natural language sentences containing animals. The feature is active (bold) on words referring to animals.
  • ...and 5 more figures