Table of Contents
Fetching ...

Evaluating Adversarial Robustness of Concept Representations in Sparse Autoencoders

Aaron J. Li, Suraj Srinivas, Usha Bhalla, Himabindu Lakkaraju

TL;DR

This work tackles the robustness of sparse autoencoder (SAE) based concept representations used to interpret large language models (LLMs). It formalizes robustness as an input-space optimization problem, combining a ground-truth concept map with a bi-Lipschitz assumption to relate input edits to concept-label changes, and adapts a generalized Greedy Coordinate Gradient (GCG) attack to search adversarial inputs. The authors develop an explicit evaluation framework with semantic and activation goals across population- and individual-level perturbations, and demonstrate widespread vulnerability of SAE interpretations across multiple LLM-SAE pairs and datasets, including transferability analyses and case studies. The findings underscore the need for robustness-aware design and denoising of SAE-based explanations for reliable model monitoring and oversight, and propose a general methodology for evaluating concept-extraction tools in LLMs.

Abstract

Sparse autoencoders (SAEs) are commonly used to interpret the internal activations of large language models (LLMs) by mapping them to human-interpretable concept representations. While existing evaluations of SAEs focus on metrics such as the reconstruction-sparsity tradeoff, human (auto-)interpretability, and feature disentanglement, they overlook a critical aspect: the robustness of concept representations to input perturbations. We argue that robustness must be a fundamental consideration for concept representations, reflecting the fidelity of concept labeling. To this end, we formulate robustness quantification as input-space optimization problems and develop a comprehensive evaluation framework featuring realistic scenarios in which adversarial perturbations are crafted to manipulate SAE representations. Empirically, we find that tiny adversarial input perturbations can effectively manipulate concept-based interpretations in most scenarios without notably affecting the base LLM's activations. Overall, our results suggest that SAE concept representations are fragile and without further denoising or postprocessing they might be ill-suited for applications in model monitoring and oversight.

Evaluating Adversarial Robustness of Concept Representations in Sparse Autoencoders

TL;DR

This work tackles the robustness of sparse autoencoder (SAE) based concept representations used to interpret large language models (LLMs). It formalizes robustness as an input-space optimization problem, combining a ground-truth concept map with a bi-Lipschitz assumption to relate input edits to concept-label changes, and adapts a generalized Greedy Coordinate Gradient (GCG) attack to search adversarial inputs. The authors develop an explicit evaluation framework with semantic and activation goals across population- and individual-level perturbations, and demonstrate widespread vulnerability of SAE interpretations across multiple LLM-SAE pairs and datasets, including transferability analyses and case studies. The findings underscore the need for robustness-aware design and denoising of SAE-based explanations for reliable model monitoring and oversight, and propose a general methodology for evaluating concept-extraction tools in LLMs.

Abstract

Sparse autoencoders (SAEs) are commonly used to interpret the internal activations of large language models (LLMs) by mapping them to human-interpretable concept representations. While existing evaluations of SAEs focus on metrics such as the reconstruction-sparsity tradeoff, human (auto-)interpretability, and feature disentanglement, they overlook a critical aspect: the robustness of concept representations to input perturbations. We argue that robustness must be a fundamental consideration for concept representations, reflecting the fidelity of concept labeling. To this end, we formulate robustness quantification as input-space optimization problems and develop a comprehensive evaluation framework featuring realistic scenarios in which adversarial perturbations are crafted to manipulate SAE representations. Empirically, we find that tiny adversarial input perturbations can effectively manipulate concept-based interpretations in most scenarios without notably affecting the base LLM's activations. Overall, our results suggest that SAE concept representations are fragile and without further denoising or postprocessing they might be ill-suited for applications in model monitoring and oversight.

Paper Structure

This paper contains 42 sections, 10 equations, 17 figures, 7 tables.

Figures (17)

  • Figure 1: An example of successful targeted population-level attack that doubles the concept overlap between $x_1$ and $x_2$ with only one adversarial token replacement.
  • Figure 2: Attack Results of Gemma2-9B (131k) on AdvBench. Standard deviations are computed across $5$ experiment runs.
  • Figure 3: An example of changes in top 5 activated SAE concepts between the original $x_1$ and the perturbed $x'_1$: only 2 out of the 5 original concepts remain on top. Top concepts in $x_1$ have cold colors, while new concepts introduced by $x'_1$ have warm colors. The natural language annotations of the SAE latents are provided by Neuronpedia.
  • Figure 4: Attack Results: Gemma2-2B on AdvBench
  • Figure 5: Attack Results: Gemma2-2B on AG News
  • ...and 12 more figures