Do Sparse Subnetworks Exhibit Cognitively Aligned Attention? Effects of Pruning on Saliency Map Fidelity, Sparsity, and Concept Coherence
Sanish Suwal, Dipkamal Bhusal, Michael Clifford, Nidhi Rastogi
TL;DR
This work investigates how magnitude-based pruning with iterative fine-tuning affects interpretability in a ResNet-18 trained on ImageNette. The authors quantify interpretability using post-hoc explanations (Vanilla Gradients and Integrated Gradients) and fidelity metrics (ROAD MoRF and AOPC), alongside concept-level analysis with CRAFT, under a global pruning regime with mask $M$ and surviving weights $\theta\cdot M$. They find that light-to-moderate pruning sharpens saliency maps and preserves semantically coherent concepts, while aggressive pruning merges diverse features and degrades concept coherence, despite accurate predictions. These results imply that pruning can align human-aligned attention patterns under a careful sparsity regime, informing principled pruning strategies where interpretability is important.
Abstract
Prior works have shown that neural networks can be heavily pruned while preserving performance, but the impact of pruning on model interpretability remains unclear. In this work, we investigate how magnitude-based pruning followed by fine-tuning affects both low-level saliency maps and high-level concept representations. Using a ResNet-18 trained on ImageNette, we compare post-hoc explanations from Vanilla Gradients (VG) and Integrated Gradients (IG) across pruning levels, evaluating sparsity and faithfulness. We further apply CRAFT-based concept extraction to track changes in semantic coherence of learned concepts. Our results show that light-to-moderate pruning improves saliency-map focus and faithfulness while retaining distinct, semantically meaningful concepts. In contrast, aggressive pruning merges heterogeneous features, reducing saliency map sparsity and concept coherence despite maintaining accuracy. These findings suggest that while pruning can shape internal representations toward more human-aligned attention patterns, excessive pruning undermines interpretability.
