Table of Contents
Fetching ...

Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness

Qi Zhang, Yifei Wang, Jingyi Cui, Xiang Pan, Qi Lei, Stefanie Jegelka, Yisen Wang

TL;DR

This work challenges the prevailing belief of the accuracy-interpretability tradeoff, showing that monosemantic features not only enhance interpretability but also bring concrete gains in model performance, and explores the learning benefits of monosemanticity beyond interpretability.

Abstract

Deep learning models often suffer from a lack of interpretability due to polysemanticity, where individual neurons are activated by multiple unrelated semantics, resulting in unclear attributions of model behavior. Recent advances in monosemanticity, where neurons correspond to consistent and distinct semantics, have significantly improved interpretability but are commonly believed to compromise accuracy. In this work, we challenge the prevailing belief of the accuracy-interpretability tradeoff, showing that monosemantic features not only enhance interpretability but also bring concrete gains in model performance. Across multiple robust learning scenarios-including input and label noise, few-shot learning, and out-of-domain generalization-our results show that models leveraging monosemantic features significantly outperform those relying on polysemantic features. Furthermore, we provide empirical and theoretical understandings on the robustness gains of feature monosemanticity. Our preliminary analysis suggests that monosemanticity, by promoting better separation of feature representations, leads to more robust decision boundaries. This diverse evidence highlights the generality of monosemanticity in improving model robustness. As a first step in this new direction, we embark on exploring the learning benefits of monosemanticity beyond interpretability, supporting the long-standing hypothesis of linking interpretability and robustness. Code is available at \url{https://github.com/PKU-ML/Beyond_Interpretability}.

Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness

TL;DR

This work challenges the prevailing belief of the accuracy-interpretability tradeoff, showing that monosemantic features not only enhance interpretability but also bring concrete gains in model performance, and explores the learning benefits of monosemanticity beyond interpretability.

Abstract

Deep learning models often suffer from a lack of interpretability due to polysemanticity, where individual neurons are activated by multiple unrelated semantics, resulting in unclear attributions of model behavior. Recent advances in monosemanticity, where neurons correspond to consistent and distinct semantics, have significantly improved interpretability but are commonly believed to compromise accuracy. In this work, we challenge the prevailing belief of the accuracy-interpretability tradeoff, showing that monosemantic features not only enhance interpretability but also bring concrete gains in model performance. Across multiple robust learning scenarios-including input and label noise, few-shot learning, and out-of-domain generalization-our results show that models leveraging monosemantic features significantly outperform those relying on polysemantic features. Furthermore, we provide empirical and theoretical understandings on the robustness gains of feature monosemanticity. Our preliminary analysis suggests that monosemanticity, by promoting better separation of feature representations, leads to more robust decision boundaries. This diverse evidence highlights the generality of monosemanticity in improving model robustness. As a first step in this new direction, we embark on exploring the learning benefits of monosemanticity beyond interpretability, supporting the long-standing hypothesis of linking interpretability and robustness. Code is available at \url{https://github.com/PKU-ML/Beyond_Interpretability}.

Paper Structure

This paper contains 33 sections, 11 theorems, 61 equations, 6 figures, 2 tables.

Key Result

Theorem 4.1

Let $\nu_{\mathrm{mono}}=x_1$ and $\nu_{\mathrm{poly}}=x_1-x_2$. For conditional means, we have $\mu_0(\nu_{\mathrm{poly}}) < \mu_0(\nu_{\mathrm{mono}})$ and $\mu_1(\nu_{\mathrm{poly}}) < \mu_1(\nu_{\mathrm{mono}})$, yet $\Delta \mu(\nu_{\mathrm{poly}}) > \Delta \mu(\nu_{\mathrm{mono}})$. For condit

Figures (6)

  • Figure 1: A comparison between polysemantic (CL) and monosemantic features (NCL, SAE) pretrained on ImageNet-100. We consider noisy labels (90 % noise rate) and Gaussian input noise ($0.6$ stdev); see more details in Appendix \ref{['subsec:details of Figure1']}.
  • Figure 2: The evaluation of robustness against input distribution shifts on ImageNet-100. Monosemantic representations (SAE,NCL) exhibit improvements in the robustness against different kinds of distribution shifts.
  • Figure 3: The robustness of the models finetuned with polysemanticity (CE) and monosemanticity (NCE) under different noises on ImageNet-100. Attaining monosemanticity during the finetuning process enhances the robustness across various tasks.
  • Figure 4: Influence of feature monosemanticity on classification performance, where the classifier is applied after a frozen contrastive encoder and trained with 90% noisy labels. (a), (b) respectively draw the activated samples on the dimensions with the largest clssifier weight of the lowest-accuracy and highest-accuracy classes on ImageNet-100. (c) demonstrates the monosemanticity scores wang2024non of wrongly and correctly classified samples.
  • Figure 5: The comparison between polysemantic and monosemantic features on the toy model introduced by elhage2022toy ($n=40$, $m=20$, $S=0.2$). (a) demonstrates the Parameters ($W^\top W$) of monosemantic (Left) and polysemantic features (Right) on the Toy Model. (b) evaluates the classification performance of features against different noises. The label noise denotes applying 90% noisy labels to the training samples and input noise denotes applying Gaussian noise to the validation samples.
  • ...and 1 more figures

Theorems & Definitions (22)

  • Theorem 4.1: Conditional means and variances of monosemantic & polysemantic features
  • Theorem 4.2: Influence of label noise on linear seprarability
  • Theorem B.1: Conditional mean and variance of monosemantic representations
  • proof : Proof of Theorem \ref{['thm::muwo']}
  • Lemma B.2: Distribution of $\nu_{\mathrm{poly}}=x_1-x_2$
  • proof : Proof of Lemma \ref{['lem::dist_nuw']}
  • Theorem B.3: Conditional mean and variance of polysemantic representations
  • proof : Proof of Theorem \ref{['thm::muw']}
  • proof : Proof of Theorem \ref{['thm::superposition']}
  • Lemma B.4: Conditional Distributions
  • ...and 12 more