Sparse Autoencoder Features for Classifications and Transferability
Jack Gallifant, Shan Chen, Kuleen Sasse, Hugo Aerts, Thomas Hartvigsen, Danielle S. Bitterman
TL;DR
This work presents a comprehensive, reproducible evaluation of Sparse Autoencoder (SAE) features as interpretable representations extracted from large language models for safety-critical classification. It systematically analyzes SAE configurations (layer choice, width, pooling, and binarization), across model scales (Gemma 2 2B/9B/9B-IT) and tasks, showing that binarized, summation-based SAE features from middle layers often outperform TF-IDF and hidden-state baselines, achieving macro $F1$ scores exceeding $0.85$ on several benchmarks. The study extends to multilingual and cross-modal settings, finding native-language training generally superior but English-transfer remains viable, and provides preliminary cross-modal transfer evidence in vision-language tasks. Additionally, it demonstrates that smaller SAE-based features can predict the actions of larger instruction-tuned models, highlighting a scalable approach to model oversight. Together, these results establish practical best practices for SAE-based interpretability and point toward scalable, transparent deployment of LLMs in real-world applications.
Abstract
Sparse Autoencoders (SAEs) provide potentials for uncovering structured, human-interpretable representations in Large Language Models (LLMs), making them a crucial tool for transparent and controllable AI systems. We systematically analyze SAE for interpretable feature extraction from LLMs in safety-critical classification tasks. Our framework evaluates (1) model-layer selection and scaling properties, (2) SAE architectural configurations, including width and pooling strategies, and (3) the effect of binarizing continuous SAE activations. SAE-derived features achieve macro F1 > 0.8, outperforming hidden-state and BoW baselines while demonstrating cross-model transfer from Gemma 2 2B to 9B-IT models. These features generalize in a zero-shot manner to cross-lingual toxicity detection and visual classification tasks. Our analysis highlights the significant impact of pooling strategies and binarization thresholds, showing that binarization offers an efficient alternative to traditional feature selection while maintaining or improving performance. These findings establish new best practices for SAE-based interpretability and enable scalable, transparent deployment of LLMs in real-world applications. Full repo: https://github.com/shan23chen/MOSAIC.
