Table of Contents
Fetching ...

SuperActivators: Only the Tail of the Distribution Contains Reliable Concept Signals

Cassandra Goldberg, Chaehyeon Kim, Adam Stein, Eric Wong

TL;DR

The paper identifies a SuperActivator Mechanism in transformer representations, revealing that while most concept activations are noisy and overlapping, the extreme high tail of in-concept activations contains reliable signals of concept presence. By thresholds derived from validation data and a max-pooling aggregation over the tail, SuperActivators achieve up to 14% absolute gains in concept-detection F1 across image and text modalities, architectures, and concept-extraction methods. Beyond detection, leveraging SuperActivators for attribution yields more accurate and faithful concept localizations than global concept vectors, with consistent improvements across benchmarks. The findings suggest that sparse, tail-focused signals are a robust, generalizable source of semantic information in transformers, with practical implications for interpretable AI across multimodal tasks.

Abstract

Concept vectors aim to enhance model interpretability by linking internal representations with human-understandable semantics, but their utility is often limited by noisy and inconsistent activations. In this work, we uncover a clear pattern within the noise, which we term the SuperActivator Mechanism: while in-concept and out-of-concept activations overlap considerably, the token activations in the extreme high tail of the in-concept distribution provide a reliable signal of concept presence. We demonstrate the generality of this mechanism by showing that SuperActivator tokens consistently outperform standard vector-based and prompting concept detection approaches, achieving up to a 14% higher F1 score across image and text modalities, model architectures, model layers, and concept extraction techniques. Finally, we leverage SuperActivator tokens to improve feature attributions for concepts.

SuperActivators: Only the Tail of the Distribution Contains Reliable Concept Signals

TL;DR

The paper identifies a SuperActivator Mechanism in transformer representations, revealing that while most concept activations are noisy and overlapping, the extreme high tail of in-concept activations contains reliable signals of concept presence. By thresholds derived from validation data and a max-pooling aggregation over the tail, SuperActivators achieve up to 14% absolute gains in concept-detection F1 across image and text modalities, architectures, and concept-extraction methods. Beyond detection, leveraging SuperActivators for attribution yields more accurate and faithful concept localizations than global concept vectors, with consistent improvements across benchmarks. The findings suggest that sparse, tail-focused signals are a robust, generalizable source of semantic information in transformers, with practical implications for interpretable AI across multimodal tasks.

Abstract

Concept vectors aim to enhance model interpretability by linking internal representations with human-understandable semantics, but their utility is often limited by noisy and inconsistent activations. In this work, we uncover a clear pattern within the noise, which we term the SuperActivator Mechanism: while in-concept and out-of-concept activations overlap considerably, the token activations in the extreme high tail of the in-concept distribution provide a reliable signal of concept presence. We demonstrate the generality of this mechanism by showing that SuperActivator tokens consistently outperform standard vector-based and prompting concept detection approaches, achieving up to a 14% higher F1 score across image and text modalities, model architectures, model layers, and concept extraction techniques. Finally, we leverage SuperActivator tokens to improve feature attributions for concepts.

Paper Structure

This paper contains 61 sections, 15 equations, 39 figures, 14 tables.

Figures (39)

  • Figure 1: The SuperActivator Mechanism concentrates the most informative concept signals into a sparse set of in-concept activations. These signals reliably distinguish true concept occurrences even when concept activation heatmaps spuriously highlight absent concepts or fail to fully capture present ones. This example shows LLaMA-3.2-11B-Vision-Instruct linear separator concept activations on a COCO image; examples for all image and text datasets are provided in Appendix \ref{['app:superactivator-examples']}.
  • Figure 2: Transformers express concept activations inconsistently, making it difficult to distinguish in-concept tokens from out-of-concept tokens. In this test-set example from the Augmented GoEmotions dataset, the ground-truth span for Joy is highlighted, with token-level activations for LLaMA-Vision-Instruct-11B linear separator concepts shown both as a heatmap over the text (left) and as distributions (right). While a few in-concept tokens exhibit extremely high activations, many remain indistinguishable from out-of-concept token activations within the sample and across $D_c^{\text{out}}$.
  • Figure 3: $D_c^{\text{in}}$ and $D_c^{\text{out}}$ become more distinct with depth, though the separation is concentrated in a small subset of tokens in the tail of $D_c^{\text{in}}$. Shown here are activation distributions for three linear separator concepts from LLaMA-3.2-11B-Vision-Instruct on the OpenSurfaces dataset (left), as well as the proportion of $D_c^{\text{in}}$ activations exceeding $q_{0.98}(D_c^{\text{out}})$ across layers (right).
  • Figure 4: Most true-concept images in the OpenSurfaces dataset have at least one Llama-3.2-11b-Vision-Instruct linear separator activation in the high-activation tail of $D_c^{\text{in}}$, well separated from $q_{0.98}(D_c^{\text{out}})$.
  • Figure 5: SuperActivator-based concept detection is most effective when using only a small fraction of the most highly activated tokens ($5$--$10\%$). This figure presents the number of LLaMA-3.2-11B-Vision-Instruct linear separator concept vectors that achieve their strongest $F_1$ scores at each sparsity level $\delta$. Comprehensive results are provided in Appendix \ref{['app:optimal-sparsity-across-layers']}.
  • ...and 34 more figures