Table of Contents
Fetching ...

SALVE: Sparse Autoencoder-Latent Vector Editing for Mechanistic Control of Neural Networks

Vegard Flovik

TL;DR

SALVE introduces a three-stage pipeline (discover, validate, control) to turn mechanistic interpretability into durable model edits. It trains a sparse autoencoder on internal activations to extract a model-native feature basis, grounds these features with Grad-FAM and activation maximization, and performs permanent weight-space edits guided by the SAE decoder. The framework yields per-sample robustness diagnostics via the critical suppression threshold $α_{\rm crit}$ and demonstrates controlled, class-targeted interventions on both ResNet-18 and ViT-B/16 across Imagenette and CIFAR-100. Across CNNs and transformers, SALVE shows precise manipulation of concepts with minimal off-target effects and provides a principled path toward transparent, editable AI systems.

Abstract

Deep neural networks achieve impressive performance but remain difficult to interpret and control. We present SALVE (Sparse Autoencoder-Latent Vector Editing), a unified "discover, validate, and control" framework that bridges mechanistic interpretability and model editing. Using an $\ell_1$-regularized autoencoder, we learn a sparse, model-native feature basis without supervision. We validate these features with Grad-FAM, a feature-level saliency mapping method that visually grounds latent features in input data. Leveraging the autoencoder's structure, we perform precise and permanent weight-space interventions, enabling continuous modulation of both class-defining and cross-class features. We further derive a critical suppression threshold, $α_{crit}$, quantifying each class's reliance on its dominant feature, supporting fine-grained robustness diagnostics. Our approach is validated on both convolutional (ResNet-18) and transformer-based (ViT-B/16) models, demonstrating consistent, interpretable control over their behavior. This work contributes a principled methodology for turning feature discovery into actionable model edits, advancing the development of transparent and controllable AI systems.

SALVE: Sparse Autoencoder-Latent Vector Editing for Mechanistic Control of Neural Networks

TL;DR

SALVE introduces a three-stage pipeline (discover, validate, control) to turn mechanistic interpretability into durable model edits. It trains a sparse autoencoder on internal activations to extract a model-native feature basis, grounds these features with Grad-FAM and activation maximization, and performs permanent weight-space edits guided by the SAE decoder. The framework yields per-sample robustness diagnostics via the critical suppression threshold and demonstrates controlled, class-targeted interventions on both ResNet-18 and ViT-B/16 across Imagenette and CIFAR-100. Across CNNs and transformers, SALVE shows precise manipulation of concepts with minimal off-target effects and provides a principled path toward transparent, editable AI systems.

Abstract

Deep neural networks achieve impressive performance but remain difficult to interpret and control. We present SALVE (Sparse Autoencoder-Latent Vector Editing), a unified "discover, validate, and control" framework that bridges mechanistic interpretability and model editing. Using an -regularized autoencoder, we learn a sparse, model-native feature basis without supervision. We validate these features with Grad-FAM, a feature-level saliency mapping method that visually grounds latent features in input data. Leveraging the autoencoder's structure, we perform precise and permanent weight-space interventions, enabling continuous modulation of both class-defining and cross-class features. We further derive a critical suppression threshold, , quantifying each class's reliance on its dominant feature, supporting fine-grained robustness diagnostics. Our approach is validated on both convolutional (ResNet-18) and transformer-based (ViT-B/16) models, demonstrating consistent, interpretable control over their behavior. This work contributes a principled methodology for turning feature discovery into actionable model edits, advancing the development of transparent and controllable AI systems.

Paper Structure

This paper contains 50 sections, 19 equations, 16 figures, 1 table.

Figures (16)

  • Figure 1: Validation of discovered features for the ResNet-18 model. (a) Average latent feature activations across classes, showing a sparse, class-specific basis. (b) Activation maximization of the "golf ball" feature. The image evolves from random noise ($t_1$), through emergence of circular shapes ($t_2$), to the final image clearly exhibiting golf ball characteristics ($t_3$). (c) Grad-FAM visualizations grounding the top-4 dominant features for a sample "golf ball" image.
  • Figure 2: (Left) A qualitative case study where suppressing the "Church" feature or enhancing the "Golf ball" feature successfully flips the model's prediction for an ambiguous image. (Right) Quantitative validation on the test set, showing minimal off-target effects
  • Figure 3: a) Validation of the "Tower Feature", showing example top-activating images from the test set. b) Effect of suppressing and enhancing this feature on model predictions, where "red" and "green" corresponds to decreasing or increasing class accuracy, respectively
  • Figure 4: Suppression sensitivity for "Church" (Class 4). (a) Per-class accuracy vs. intervention strength $\alpha$. (b) Distribution of predictions for images of the target class, showing how confidence is reallocated as the feature is suppressed. Shaded regions indicate the standard deviation across 10 SAE initializations.
  • Figure 5: (a) Per-class accuracy vs. the intervention strength $\alpha$. (b) Comparison of the critical suppression threshold estimates. The distributions of the per-sample analytical (filled box) and numerical (hatched box) $\alpha_{\text{crit}}$ are shown as boxplots. The central line indicates the median, the box spans the interquartile range (25th to 75th percentile), and the whiskers extend to the 5th and 95th percentiles. The empirical threshold ($\alpha_{50\%}$) is overlaid as a square marker.
  • ...and 11 more figures