Table of Contents
Fetching ...

How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language Models

Hector Borobia, Elies Seguí-Mas, Guillermina Tormo-Carbó

Abstract

Weight pruning is a standard technique for compressing large language models, yet its effect on learned internal representations remains poorly understood. We present the first systematic study of how unstructured pruning reshapes the feature geometry of language models, using Sparse Autoencoders (SAEs) as interpretability probes. Across three model families (Gemma 3 1B, Gemma 2 2B, Llama 3.2 1B), two pruning methods (magnitude and Wanda), and six sparsity levels (0--60%), we investigate five research questions spanning seed stability, feature survival, SAE transferability, feature fragility, and causal relevance. Our most striking finding is that rare SAE features--those with low firing rates--survive pruning far better than frequent ones, with within-condition Spearman correlations of rho = -1.0 in 11 of 17 experimental conditions. This counter-intuitive result suggests that pruning acts as implicit feature selection, preferentially destroying high-frequency generic features while preserving specialized rare ones. We further show that Wanda pruning preserves feature structure up to 3.7x better than magnitude pruning, that pre-trained SAEs remain viable on Wanda-pruned models up to 50% sparsity, and that geometric feature survival does not predict causal importance--a dissociation with implications for interpretability under compression.

How Pruning Reshapes Features: Sparse Autoencoder Analysis of Weight-Pruned Language Models

Abstract

Weight pruning is a standard technique for compressing large language models, yet its effect on learned internal representations remains poorly understood. We present the first systematic study of how unstructured pruning reshapes the feature geometry of language models, using Sparse Autoencoders (SAEs) as interpretability probes. Across three model families (Gemma 3 1B, Gemma 2 2B, Llama 3.2 1B), two pruning methods (magnitude and Wanda), and six sparsity levels (0--60%), we investigate five research questions spanning seed stability, feature survival, SAE transferability, feature fragility, and causal relevance. Our most striking finding is that rare SAE features--those with low firing rates--survive pruning far better than frequent ones, with within-condition Spearman correlations of rho = -1.0 in 11 of 17 experimental conditions. This counter-intuitive result suggests that pruning acts as implicit feature selection, preferentially destroying high-frequency generic features while preserving specialized rare ones. We further show that Wanda pruning preserves feature structure up to 3.7x better than magnitude pruning, that pre-trained SAEs remain viable on Wanda-pruned models up to 50% sparsity, and that geometric feature survival does not predict causal importance--a dissociation with implications for interpretability under compression.

Paper Structure

This paper contains 31 sections, 1 equation, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Seed stability (MNN rate at $\tau = 0.7$) as a function of sparsity. Error bars show standard deviation across seed pairs. While absolute MNN rates are low (2--4%), the pattern of degradation with sparsity is consistent.
  • Figure 2: Dense$\to$pruned feature survival (MNN at $\tau = 0.7$). Wanda consistently preserves more features than magnitude pruning across all models and sparsity levels.
  • Figure 3: Transferability of pre-trained (official) SAEs to pruned model activations. FVU is plotted against sparsity for Gemma 3 1B and Gemma 2 2B, separated by pruning method. Wanda-pruned activations remain well within the reconstructable range.
  • Figure 4: Feature survival rate by firing rate quintile ($\tau = 0.7$). Across all models and pruning methods, rare features (Q1) survive substantially better than frequent ones (Q5). Lines represent different pruning method--sparsity combinations.
  • Figure 5: Feature fragility by firing rate, separated by pruning method. Blue: magnitude pruning; orange: Wanda. The monotonic decline from Q1 (rare) to Q5 (frequent) is robust across all models, methods, and sparsity levels.
  • ...and 1 more figures