Table of Contents
Fetching ...

Sparse Autoencoder Features for Classifications and Transferability

Jack Gallifant, Shan Chen, Kuleen Sasse, Hugo Aerts, Thomas Hartvigsen, Danielle S. Bitterman

TL;DR

This work presents a comprehensive, reproducible evaluation of Sparse Autoencoder (SAE) features as interpretable representations extracted from large language models for safety-critical classification. It systematically analyzes SAE configurations (layer choice, width, pooling, and binarization), across model scales (Gemma 2 2B/9B/9B-IT) and tasks, showing that binarized, summation-based SAE features from middle layers often outperform TF-IDF and hidden-state baselines, achieving macro $F1$ scores exceeding $0.85$ on several benchmarks. The study extends to multilingual and cross-modal settings, finding native-language training generally superior but English-transfer remains viable, and provides preliminary cross-modal transfer evidence in vision-language tasks. Additionally, it demonstrates that smaller SAE-based features can predict the actions of larger instruction-tuned models, highlighting a scalable approach to model oversight. Together, these results establish practical best practices for SAE-based interpretability and point toward scalable, transparent deployment of LLMs in real-world applications.

Abstract

Sparse Autoencoders (SAEs) provide potentials for uncovering structured, human-interpretable representations in Large Language Models (LLMs), making them a crucial tool for transparent and controllable AI systems. We systematically analyze SAE for interpretable feature extraction from LLMs in safety-critical classification tasks. Our framework evaluates (1) model-layer selection and scaling properties, (2) SAE architectural configurations, including width and pooling strategies, and (3) the effect of binarizing continuous SAE activations. SAE-derived features achieve macro F1 > 0.8, outperforming hidden-state and BoW baselines while demonstrating cross-model transfer from Gemma 2 2B to 9B-IT models. These features generalize in a zero-shot manner to cross-lingual toxicity detection and visual classification tasks. Our analysis highlights the significant impact of pooling strategies and binarization thresholds, showing that binarization offers an efficient alternative to traditional feature selection while maintaining or improving performance. These findings establish new best practices for SAE-based interpretability and enable scalable, transparent deployment of LLMs in real-world applications. Full repo: https://github.com/shan23chen/MOSAIC.

Sparse Autoencoder Features for Classifications and Transferability

TL;DR

This work presents a comprehensive, reproducible evaluation of Sparse Autoencoder (SAE) features as interpretable representations extracted from large language models for safety-critical classification. It systematically analyzes SAE configurations (layer choice, width, pooling, and binarization), across model scales (Gemma 2 2B/9B/9B-IT) and tasks, showing that binarized, summation-based SAE features from middle layers often outperform TF-IDF and hidden-state baselines, achieving macro scores exceeding on several benchmarks. The study extends to multilingual and cross-modal settings, finding native-language training generally superior but English-transfer remains viable, and provides preliminary cross-modal transfer evidence in vision-language tasks. Additionally, it demonstrates that smaller SAE-based features can predict the actions of larger instruction-tuned models, highlighting a scalable approach to model oversight. Together, these results establish practical best practices for SAE-based interpretability and point toward scalable, transparent deployment of LLMs in real-world applications.

Abstract

Sparse Autoencoders (SAEs) provide potentials for uncovering structured, human-interpretable representations in Large Language Models (LLMs), making them a crucial tool for transparent and controllable AI systems. We systematically analyze SAE for interpretable feature extraction from LLMs in safety-critical classification tasks. Our framework evaluates (1) model-layer selection and scaling properties, (2) SAE architectural configurations, including width and pooling strategies, and (3) the effect of binarizing continuous SAE activations. SAE-derived features achieve macro F1 > 0.8, outperforming hidden-state and BoW baselines while demonstrating cross-model transfer from Gemma 2 2B to 9B-IT models. These features generalize in a zero-shot manner to cross-lingual toxicity detection and visual classification tasks. Our analysis highlights the significant impact of pooling strategies and binarization thresholds, showing that binarization offers an efficient alternative to traditional feature selection while maintaining or improving performance. These findings establish new best practices for SAE-based interpretability and enable scalable, transparent deployment of LLMs in real-world applications. Full repo: https://github.com/shan23chen/MOSAIC.

Paper Structure

This paper contains 55 sections, 2 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Multilingual performance comparison across three feature selection methods under varying training data sampling rates. Solid bars represent models trained on native language data, while hatched bars show performance with English transfer learning. Binarized SAE features demonstrate robustness across different training data constraints.
  • Figure 2: Diagram explaining our approaches to evaluating token-level pooling and aggregation of SAE features.
  • Figure 3: Analysis of model performance across different layers and pooling strategies. A strong baseline is established by averaging the optimal performance per task across the hidden states across three models.
  • Figure 4: Multilingual toxicity detection results (middle-layer features): Native SAE Training (pink) consistently achieves the best F1 scores. Transferring from English (gold) or using translated inputs (green) leads to moderate performance declines. 9B-IT models show a similar trend, with slightly improved cross-lingual generalization in some language pairs.
  • Figure 5: Comparison of average F1 scores by different feature selection methods on the Multilingual Classification and Transfer task. The boxes represent the mean $\pm$ standard deviation, and the whiskers indicate the interquartile range (IQR).
  • ...and 10 more figures