SALVE: Sparse Autoencoder-Latent Vector Editing for Mechanistic Control of Neural Networks

Vegard Flovik

SALVE: Sparse Autoencoder-Latent Vector Editing for Mechanistic Control of Neural Networks

Vegard Flovik

TL;DR

SALVE introduces a three-stage pipeline (discover, validate, control) to turn mechanistic interpretability into durable model edits. It trains a sparse autoencoder on internal activations to extract a model-native feature basis, grounds these features with Grad-FAM and activation maximization, and performs permanent weight-space edits guided by the SAE decoder. The framework yields per-sample robustness diagnostics via the critical suppression threshold $α_{\rm crit}$ and demonstrates controlled, class-targeted interventions on both ResNet-18 and ViT-B/16 across Imagenette and CIFAR-100. Across CNNs and transformers, SALVE shows precise manipulation of concepts with minimal off-target effects and provides a principled path toward transparent, editable AI systems.

Abstract

Deep neural networks achieve impressive performance but remain difficult to interpret and control. We present SALVE (Sparse Autoencoder-Latent Vector Editing), a unified "discover, validate, and control" framework that bridges mechanistic interpretability and model editing. Using an $\ell_1$-regularized autoencoder, we learn a sparse, model-native feature basis without supervision. We validate these features with Grad-FAM, a feature-level saliency mapping method that visually grounds latent features in input data. Leveraging the autoencoder's structure, we perform precise and permanent weight-space interventions, enabling continuous modulation of both class-defining and cross-class features. We further derive a critical suppression threshold, $α_{crit}$, quantifying each class's reliance on its dominant feature, supporting fine-grained robustness diagnostics. Our approach is validated on both convolutional (ResNet-18) and transformer-based (ViT-B/16) models, demonstrating consistent, interpretable control over their behavior. This work contributes a principled methodology for turning feature discovery into actionable model edits, advancing the development of transparent and controllable AI systems.

SALVE: Sparse Autoencoder-Latent Vector Editing for Mechanistic Control of Neural Networks

TL;DR

and demonstrates controlled, class-targeted interventions on both ResNet-18 and ViT-B/16 across Imagenette and CIFAR-100. Across CNNs and transformers, SALVE shows precise manipulation of concepts with minimal off-target effects and provides a principled path toward transparent, editable AI systems.

Abstract

-regularized autoencoder, we learn a sparse, model-native feature basis without supervision. We validate these features with Grad-FAM, a feature-level saliency mapping method that visually grounds latent features in input data. Leveraging the autoencoder's structure, we perform precise and permanent weight-space interventions, enabling continuous modulation of both class-defining and cross-class features. We further derive a critical suppression threshold,

, quantifying each class's reliance on its dominant feature, supporting fine-grained robustness diagnostics. Our approach is validated on both convolutional (ResNet-18) and transformer-based (ViT-B/16) models, demonstrating consistent, interpretable control over their behavior. This work contributes a principled methodology for turning feature discovery into actionable model edits, advancing the development of transparent and controllable AI systems.

SALVE: Sparse Autoencoder-Latent Vector Editing for Mechanistic Control of Neural Networks

TL;DR

Abstract

SALVE: Sparse Autoencoder-Latent Vector Editing for Mechanistic Control of Neural Networks

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)