Table of Contents
Fetching ...

Learning Towards Emergence: Paving the Way to Induce Emergence by Inhibiting Monosemantic Neurons on Pre-trained Models

Jiachuan Wang, Shimin Di, Tianhao Tang, Haoyang LI, Charles Wang-wai Ng, Xiaofang Zhou, Lei Chen

TL;DR

The paper addresses emergence in large-scale pretraining and the goal of inducing emergence by suppressing monosemantic neurons. It extends prior work with Learning Towards Emergence (L2E), introducing a moving-threshold neuron retrieval mechanism, the False Killing Rate to quantify inhibition side effects, and a regularization-style inhibition loss, enabling effective monosemantic suppression during pretraining on very large models. A Monosemanticity Score (MS) is validated as a scalable diagnostic across scales and layers, and extensive experiments on Pythia models (70M, 410M, 2.8B) demonstrate that L2E improves performance on several reasoning tasks while maintaining efficiency. Overall, the work provides a practical framework for moderating monosemanticity to support emergent capabilities in large language models, with clear directions for broader validation and future research into emergence mechanisms.

Abstract

Emergence, the phenomenon of a rapid performance increase once the model scale reaches a threshold, has achieved widespread attention recently. The literature has observed that monosemantic neurons in neural networks gradually diminish as the model scale increases. Subsequently, Learning From Emergence is proposed to actively inhibit monosemantic neurons in relatively small neural networks (e.g., BERT and Swin-Transformer) for promoting model performance with fine-tuning. However, to ultimately achieve emergence, it is demanding to support the monosemantic neuron inhibition in the pretraining phase of large-scale models. Thus, this work further pushes the boundary of this research direction to be Learning Towards Emergence (L2E) and enables the training and validating of the impact of inhibiting monosemantic neurons on larger pre-trained neural networks (e.g., Pythia-70M, 410M, and 2.8B). More specifically, to bridge the gap in current research, we first conduct experiments on models of various scales (up to 6.9B) to validate the monosemantic ideas. Then, we present a novel method L2E to address the inefficient monosemantic neuron retrieval and ineffective monosemantic neuron inhibition when existing methods are applied in the pretraining phase of large-scale models. It employs an adjustable thresholding technique for efficient neuron retrieval, incorporates a False Killing Rate metric to assess inhibition effects, and proposes a regularization-style inhibition approach, which addresses the limitations of previous approaches in both efficiency and effectiveness. Experimental results demonstrate the effectiveness of L2E's monosemantic neuron inhibition and its efficiency in implementation with large-scale models.

Learning Towards Emergence: Paving the Way to Induce Emergence by Inhibiting Monosemantic Neurons on Pre-trained Models

TL;DR

The paper addresses emergence in large-scale pretraining and the goal of inducing emergence by suppressing monosemantic neurons. It extends prior work with Learning Towards Emergence (L2E), introducing a moving-threshold neuron retrieval mechanism, the False Killing Rate to quantify inhibition side effects, and a regularization-style inhibition loss, enabling effective monosemantic suppression during pretraining on very large models. A Monosemanticity Score (MS) is validated as a scalable diagnostic across scales and layers, and extensive experiments on Pythia models (70M, 410M, 2.8B) demonstrate that L2E improves performance on several reasoning tasks while maintaining efficiency. Overall, the work provides a practical framework for moderating monosemanticity to support emergent capabilities in large language models, with clear directions for broader validation and future research into emergence mechanisms.

Abstract

Emergence, the phenomenon of a rapid performance increase once the model scale reaches a threshold, has achieved widespread attention recently. The literature has observed that monosemantic neurons in neural networks gradually diminish as the model scale increases. Subsequently, Learning From Emergence is proposed to actively inhibit monosemantic neurons in relatively small neural networks (e.g., BERT and Swin-Transformer) for promoting model performance with fine-tuning. However, to ultimately achieve emergence, it is demanding to support the monosemantic neuron inhibition in the pretraining phase of large-scale models. Thus, this work further pushes the boundary of this research direction to be Learning Towards Emergence (L2E) and enables the training and validating of the impact of inhibiting monosemantic neurons on larger pre-trained neural networks (e.g., Pythia-70M, 410M, and 2.8B). More specifically, to bridge the gap in current research, we first conduct experiments on models of various scales (up to 6.9B) to validate the monosemantic ideas. Then, we present a novel method L2E to address the inefficient monosemantic neuron retrieval and ineffective monosemantic neuron inhibition when existing methods are applied in the pretraining phase of large-scale models. It employs an adjustable thresholding technique for efficient neuron retrieval, incorporates a False Killing Rate metric to assess inhibition effects, and proposes a regularization-style inhibition approach, which addresses the limitations of previous approaches in both efficiency and effectiveness. Experimental results demonstrate the effectiveness of L2E's monosemantic neuron inhibition and its efficiency in implementation with large-scale models.

Paper Structure

This paper contains 31 sections, 9 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: A demonstration of the concept "monosemantic". The left figure shows the output statistics of a monosemantic neuron, which is activated only by the feature "Python". This contrasts with a randomly selected neuron in the right figure. We use sparse probing sparse on Pythia models pythia to detect monosemantic neurons.
  • Figure 2: A monosemantic neuron with a negative mean difference. The average value of the neuron is also much larger than 0.
  • Figure 3: Validation of the effectiveness of MS. We probe neurons in Pythia models pythia based on feature datasets Code Language (a) and Data Subset (b) sparse.
  • Figure 4: K-S test for the monosemanticity levels across model scales on 3 feature datasets.
  • Figure 5: Statistics of MS across scales and layers. The results are obtained on Natural Language feature dataset sparse. Larger models are of deeper colors, which can be clearly observed that their scores are smaller, indicating lower monosemanticity.
  • ...and 8 more figures