SuperActivators: Only the Tail of the Distribution Contains Reliable Concept Signals
Cassandra Goldberg, Chaehyeon Kim, Adam Stein, Eric Wong
TL;DR
The paper identifies a SuperActivator Mechanism in transformer representations, revealing that while most concept activations are noisy and overlapping, the extreme high tail of in-concept activations contains reliable signals of concept presence. By thresholds derived from validation data and a max-pooling aggregation over the tail, SuperActivators achieve up to 14% absolute gains in concept-detection F1 across image and text modalities, architectures, and concept-extraction methods. Beyond detection, leveraging SuperActivators for attribution yields more accurate and faithful concept localizations than global concept vectors, with consistent improvements across benchmarks. The findings suggest that sparse, tail-focused signals are a robust, generalizable source of semantic information in transformers, with practical implications for interpretable AI across multimodal tasks.
Abstract
Concept vectors aim to enhance model interpretability by linking internal representations with human-understandable semantics, but their utility is often limited by noisy and inconsistent activations. In this work, we uncover a clear pattern within the noise, which we term the SuperActivator Mechanism: while in-concept and out-of-concept activations overlap considerably, the token activations in the extreme high tail of the in-concept distribution provide a reliable signal of concept presence. We demonstrate the generality of this mechanism by showing that SuperActivator tokens consistently outperform standard vector-based and prompting concept detection approaches, achieving up to a 14% higher F1 score across image and text modalities, model architectures, model layers, and concept extraction techniques. Finally, we leverage SuperActivator tokens to improve feature attributions for concepts.
