Table of Contents
Fetching ...

Discovering Salient Neurons in Deep NLP Models

Nadir Durrani, Fahim Dalvi, Hassan Sajjad

TL;DR

The paper tackles the challenge of interpreting deep NLP models at the neuron level by introducing Linguistic Correlation Analysis (LCA), a three-step method that locates salient neurons tied to extrinsic linguistic properties via linear probes and neuron-weight-based rankings. It systematically examines how these neurons distribute across layers, how many are needed to preserve task performance, and how transfer learning and multilingual training affect neuron saliency, revealing patterns of localization and redundancy across architectures. Key findings include early-layer localization for lexical properties, higher-layer emphasis for syntax, widespread redundancy with many minimal subsets capable of supporting tasks, and architecture- and language-dependent shifts in neuron allocation during fine-tuning. The work demonstrates practical implications for pruning, targeted transfer learning, and interpretable model analysis, and provides a publicly available NeuroX toolkit to visualize and explore salient neurons across models.

Abstract

While a lot of work has been done in understanding representations learned within deep NLP models and what knowledge they capture, little attention has been paid towards individual neurons. We present a technique called as Linguistic Correlation Analysis to extract salient neurons in the model, with respect to any extrinsic property - with the goal of understanding how such a knowledge is preserved within neurons. We carry out a fine-grained analysis to answer the following questions: (i) can we identify subsets of neurons in the network that capture specific linguistic properties? (ii) how localized or distributed neurons are across the network? iii) how redundantly is the information preserved? iv) how fine-tuning pre-trained models towards downstream NLP tasks, impacts the learned linguistic knowledge? iv) how do architectures vary in learning different linguistic properties? Our data-driven, quantitative analysis illuminates interesting findings: (i) we found small subsets of neurons that can predict different linguistic tasks, ii) with neurons capturing basic lexical information (such as suffixation) localized in lower most layers, iii) while those learning complex concepts (such as syntactic role) predominantly in middle and higher layers, iii) that salient linguistic neurons are relocated from higher to lower layers during transfer learning, as the network preserve the higher layers for task specific information, iv) we found interesting differences across pre-trained models, with respect to how linguistic information is preserved within, and v) we found that concept exhibit similar neuron distribution across different languages in the multilingual transformer models. Our code is publicly available as part of the NeuroX toolkit.

Discovering Salient Neurons in Deep NLP Models

TL;DR

The paper tackles the challenge of interpreting deep NLP models at the neuron level by introducing Linguistic Correlation Analysis (LCA), a three-step method that locates salient neurons tied to extrinsic linguistic properties via linear probes and neuron-weight-based rankings. It systematically examines how these neurons distribute across layers, how many are needed to preserve task performance, and how transfer learning and multilingual training affect neuron saliency, revealing patterns of localization and redundancy across architectures. Key findings include early-layer localization for lexical properties, higher-layer emphasis for syntax, widespread redundancy with many minimal subsets capable of supporting tasks, and architecture- and language-dependent shifts in neuron allocation during fine-tuning. The work demonstrates practical implications for pruning, targeted transfer learning, and interpretable model analysis, and provides a publicly available NeuroX toolkit to visualize and explore salient neurons across models.

Abstract

While a lot of work has been done in understanding representations learned within deep NLP models and what knowledge they capture, little attention has been paid towards individual neurons. We present a technique called as Linguistic Correlation Analysis to extract salient neurons in the model, with respect to any extrinsic property - with the goal of understanding how such a knowledge is preserved within neurons. We carry out a fine-grained analysis to answer the following questions: (i) can we identify subsets of neurons in the network that capture specific linguistic properties? (ii) how localized or distributed neurons are across the network? iii) how redundantly is the information preserved? iv) how fine-tuning pre-trained models towards downstream NLP tasks, impacts the learned linguistic knowledge? iv) how do architectures vary in learning different linguistic properties? Our data-driven, quantitative analysis illuminates interesting findings: (i) we found small subsets of neurons that can predict different linguistic tasks, ii) with neurons capturing basic lexical information (such as suffixation) localized in lower most layers, iii) while those learning complex concepts (such as syntactic role) predominantly in middle and higher layers, iii) that salient linguistic neurons are relocated from higher to lower layers during transfer learning, as the network preserve the higher layers for task specific information, iv) we found interesting differences across pre-trained models, with respect to how linguistic information is preserved within, and v) we found that concept exhibit similar neuron distribution across different languages in the multilingual transformer models. Our code is publicly available as part of the NeuroX toolkit.
Paper Structure (23 sections, 2 equations, 14 figures, 9 tables, 4 algorithms)

This paper contains 23 sections, 2 equations, 14 figures, 9 tables, 4 algorithms.

Figures (14)

  • Figure 1: Linguistic Correlation Analysis: Extract neuron activations from a trained model, train a classifier and use weights of the classifier to extract salient neurons.
  • Figure 2: Syntactic relations according to the Universal Dependencies formalism. Here "Musharraf" and "Vajpayee" are the subject and object of "meets", respectively, obl refers to an oblique relation of the locative modifier, nmod denotes the genitive relation, the prepositions "in" and "of" are treated as case-marking elements, and "the" is a determiner. See https://universaldependencies.org/guidelines.html for detailed definitions.
  • Figure 3: Visualizations (POS) -- Neuron 193 in Layer activates positively for superlative adjectives, Neuron 750 activates positively for gerund verbs.
  • Figure 4: Visualizations (SEM) -- Neuron 651 in layer 2 activates negatively for person names, Neuron 115 is a place neuron.
  • Figure 5: Position Neuron: Activates positively in the beginning, becomes neutral in the middle and negatively towards the end of sentence.
  • ...and 9 more figures