Learning from Emergence: A Study on Proactively Inhibiting the Monosemantic Neurons of Artificial Neural Networks

Jiachuan Wang; Shimin Di; Lei Chen; Charles Wang Wai Ng

Learning from Emergence: A Study on Proactively Inhibiting the Monosemantic Neurons of Artificial Neural Networks

Jiachuan Wang, Shimin Di, Lei Chen, Charles Wang Wai Ng

TL;DR

This work addresses the problem that monosemantic neurons may limit performance as neural networks scale. It introduces MEmeL, a lightweight plug-in module that combines an online Monosemantic Scale (MS) metric $\phi$ with a Reverse Deactivation (RD) strategy to proactively suppress monosemanticity and promote polysemantic representations, without adding trainable parameters. Empirically, MEmeL and its tuned variant MEmeL-Tune achieve competitive or superior results across language (GLUE with BERT), vision (ImageNet with Swin-Transformer), and physics (ConvGRU on HKO-7) tasks, outperforming naive deactivation and matching or exceeding baseline performance with statistical significance. The authors show that pretraining with MEmeL can yield larger gains than finetuning, albeit at higher computational cost, and outline future directions for applying emergence-based inhibition to very large models and broader domains.

Abstract

Recently, emergence has received widespread attention from the research community along with the success of large-scale models. Different from the literature, we hypothesize a key factor that promotes the performance during the increase of scale: the reduction of monosemantic neurons that can only form one-to-one correlations with specific features. Monosemantic neurons tend to be sparser and have negative impacts on the performance in large models. Inspired by this insight, we propose an intuitive idea to identify monosemantic neurons and inhibit them. However, achieving this goal is a non-trivial task as there is no unified quantitative evaluation metric and simply banning monosemantic neurons does not promote polysemanticity in neural networks. Therefore, we first propose a new metric to measure the monosemanticity of neurons with the guarantee of efficiency for online computation, then introduce a theoretically supported method to suppress monosemantic neurons and proactively promote the ratios of polysemantic neurons in training neural networks. We validate our conjecture that monosemanticity brings about performance change at different model scales on a variety of neural networks and benchmark datasets in different areas, including language, image, and physics simulation tasks. Further experiments validate our analysis and theory regarding the inhibition of monosemanticity.

Learning from Emergence: A Study on Proactively Inhibiting the Monosemantic Neurons of Artificial Neural Networks

TL;DR

with a Reverse Deactivation (RD) strategy to proactively suppress monosemanticity and promote polysemantic representations, without adding trainable parameters. Empirically, MEmeL and its tuned variant MEmeL-Tune achieve competitive or superior results across language (GLUE with BERT), vision (ImageNet with Swin-Transformer), and physics (ConvGRU on HKO-7) tasks, outperforming naive deactivation and matching or exceeding baseline performance with statistical significance. The authors show that pretraining with MEmeL can yield larger gains than finetuning, albeit at higher computational cost, and outline future directions for applying emergence-based inhibition to very large models and broader domains.

Abstract

Paper Structure (46 sections, 2 theorems, 31 equations, 5 figures, 9 tables, 2 algorithms)

This paper contains 46 sections, 2 theorems, 31 equations, 5 figures, 9 tables, 2 algorithms.

Introduction
Preliminary
Activation and Monosemantic
Monosemanticity Inhibition
Methods
Metric for Monosemanticity
Inhibition of Monosemanticity
Naive Deactivation
Reversed Deactivation
Flexible Plug-in Module
Experiments
Experimental Setup
Data Sets, Base Models, and Tasks
Hyper-parameter Setting
Implementation
...and 31 more sections

Key Result

lemma 1

Denote $\mu_m$ as the value of the sample mean $\Bar{z}$ given $m$ samples, while $\upsilon_m$ as the sample variance $S^2$. When the $(m+1)^{th}\sim(m+b)^{th}$ samples ${z^{[m+1]}, \cdots, z^{[m+b]}}$ come, one can obtain the updated values via: where $\mu_b'=\frac{\sum_{i=1}^bz_{[m+i]}}{b}$ and $\upsilon'_b=\frac{\sum_{i=1}^b(z_{[m+i]}-\mu'_b)^2}{b}$, which is of $O(1)$ time and memory complexi

Figures (5)

Figure 1: Demonstration of important concepts with statistics: (a) A monosemantic neuron (orange) ideally activates for one specific type of feature. (b) A polysemantic neuron (green) activates for multiple features. (c) The output values of a monosemantic neuron when different features are inputted. Its related feature (French) produces values that significantly stand out from other features. (d) The output values of an arbitrarily selected neuron (layer 3, number 333) given different features. The values fluctuate slightly with similar patterns. These statistics are obtained by inspecting the Pythia-v0 410M model pythia.
Figure 2: We detect the monosemantic neurons of "French" following the sparse probing paper sparseprobe and run the experiments on Pythia-v0 pythia. After deactivating a monosemantic neuron for "French", there is an increase in the loss given inputs of different language features (e.g., Dutch and Greek) on Pythia models of different scales: (a) on the 70M Pythia-v0 model, (b) on the 1B Pythia-v0 model, and (c) on the 6.9B Pythia-v0 model. It can be observed that these neurons are typically monosemantic, causing a large increase in loss only when the input contains "French" (see green arrow). However, for larger models, deactivation of these neurons leads to a smaller increase in loss (see red arrow). This gives us a hint that monosemanticity may be negatively related to the scale and performance of larger models.
Figure 3: Illustration of problems and solutions to inhibit monosemanticity. (a) A monosemantic neuron $z$ only activates (orange) for the feature "cat" with a high mean value ($=7$). $z$ is deactivated (blue) for other inputs with a small mean value ($\Bar{z}=1$). (b) The first goal is to optimize the frontal model $f_1$ so that $z$ is less activated given the input "cat". (c) The second goal is to optimize the followed model $f_2$ so that a correct output for "cat" does not solely rely on $z$. (d) Zoom in on the original model at neuron $z$. (e) Naive solution that sets $z'$ to a constant 1 without gradient. (f) Naive solution that decreases the value of $z$ to $\Bar{z}$ with a constant 6 without gradient. (g) Reverse Deactivation that first reverses $z$ then pushes the output value to $\Bar{z}$ by adding a constant 8 without gradient. (h) All the methods can achieve the second goal by outputting a value $\mathcal{V}(z')=\Bar{z}$ to $f_2$. As $\Bar{z}$ provides little information, $f_2$ must learn to rely on other neurons. (i) When calculating the gradient, $f_1$ will find that $z$ is too small and tends to increase it (e.g., from 1 to 2). (j) Naive method (a) cannot update related parameters without gradient. (k) Naive method (b) further increases the underlying $z$ activation (7 to 8). (l) Reverse Deactivation inherently deactivates $z$ (from 7 to 6). When a new batch arrives, the updated $z^*$ activates less (=6) for "cat" compared with $z$.
Figure 4: Overview of our method: (a) An arbitrary neural network framework. $\mathbf{x}$ represents the input and $\mathbf{o}$ represents the output. $\mathbf{z}$s represent hidden layers of neurons, and arrows indicate the dependency relationships. (b) Our module is inserted after $\mathbf{z}^{3}$ and $\mathbf{z}^{5}$, requiring no changes to the framework. (c) Details of our module applied to $\mathbf{z}^{5}$. The input neurons are first analyzed using our metric. Once monosemantic neurons are identified, they are inhibited using RD. The resulting processed layer has the same shape as the input.
Figure 5: An analogy example to demonstrate our motivation: (a) ANNs could be similar to biological brains. Small-scale networks (both biological and artificial) cannot support complex functionality. (b) To solve a difficult problem that requires complex reasoning, pupils rely on rote memorization, while ANNs store QA pairs using monosemantic neurons as key-value pairs. (c) Large-scale networks can learn and master solving skills related to integration. (d) Adults and large-scale ANNs are capable of decomposing features and inferring answers using complex neuron circuits. This ability reduces reliance on rote memory and monosemanticity.

Theorems & Definitions (5)

definition 1: Monosemantic Scale
lemma 1
lemma 2
proof
proof

Learning from Emergence: A Study on Proactively Inhibiting the Monosemantic Neurons of Artificial Neural Networks

TL;DR

Abstract

Learning from Emergence: A Study on Proactively Inhibiting the Monosemantic Neurons of Artificial Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (5)