Towards Generating Informative Textual Description for Neurons in Language Models

Shrayani Mondal; Rishabh Garodia; Arbaaz Qureshi; Taesung Lee; Youngja Park

Towards Generating Informative Textual Description for Neurons in Language Models

Shrayani Mondal, Rishabh Garodia, Arbaaz Qureshi, Taesung Lee, Youngja Park

TL;DR

The paper addresses the challenge of understanding neuron-level information in language models by introducing an unsupervised framework that leverages generative LLMs to discover data-driven textual descriptors and maps them to neurons in a BERT model. It jointly uses descriptor generation, clustering, and exemplar-based neuron analysis to produce data-specific, interpretable neuron descriptors with minimal human input, demonstrated on the AMZN reviews dataset with BERT-base-uncased. The approach achieves precision@2 of 75% and recall@2 of 50% for neuron-descriptor tagging and shows high descriptor consistency (Jaccard around 0.95) across calibration and validation sets, indicating robustness against spurious mappings. This framework is scalable, data-driven, and applicable to other text models and datasets, with potential to improve interpretability, bias detection, and regulatory compliance in NLP systems.

Abstract

Recent developments in transformer-based language models have allowed them to capture a wide variety of world knowledge that can be adapted to downstream tasks with limited resources. However, what pieces of information are understood in these models is unclear, and neuron-level contributions in identifying them are largely unknown. Conventional approaches in neuron explainability either depend on a finite set of pre-defined descriptors or require manual annotations for training a secondary model that can then explain the neurons of the primary model. In this paper, we take BERT as an example and we try to remove these constraints and propose a novel and scalable framework that ties textual descriptions to neurons. We leverage the potential of generative language models to discover human-interpretable descriptors present in a dataset and use an unsupervised approach to explain neurons with these descriptors. Through various qualitative and quantitative analyses, we demonstrate the effectiveness of this framework in generating useful data-specific descriptors with little human involvement in identifying the neurons that encode these descriptors. In particular, our experiment shows that the proposed approach achieves 75% precision@2, and 50% recall@2

Towards Generating Informative Textual Description for Neurons in Language Models

TL;DR

Abstract

Paper Structure (20 sections, 2 equations, 9 figures, 5 tables)

This paper contains 20 sections, 2 equations, 9 figures, 5 tables.

Introduction
Related Work
Approach
Identifying Candidate Descriptors
Obtaining Descriptors for Sentences
Explaining Neurons with Descriptors
Experimental Setup
Models and Parameters
Dataset
Sentence Annotation with Descriptors
Results and Analysis
Neuron Descriptor Evaluation
Relations Among Descriptors
Neuron Descriptor Consistency
Limitations
...and 5 more sections

Figures (9)

Figure 1: The procedure to generate candidate descriptors. The descriptors can be generated for a large dataset using generative LLMs. They are clustered to reduce different expressions referring to the same meaning.
Figure 2: Proposed process flow for generating descriptors for neurons in LLMs being used for any downstream task.
Figure 3: A prompt template for identifying candidate descriptors. It is made up of the task (yellow), 1-shot example (green) and an input sentence in question (blue).
Figure 4: A prompt template for obtaining descriptors for sentences. It is made up of the task (yellow), and an input sentence $d_i$(blue).
Figure 5: Precision and Recall. The shade shows standard deviation. Top Left: Precision vs Composition Threshold, Top Right: Recall vs Composition Threshold, Bottom Left: Average Precision@K vs K, Bottom Right: Average Recall@K vs K.
...and 4 more figures

Towards Generating Informative Textual Description for Neurons in Language Models

TL;DR

Abstract

Towards Generating Informative Textual Description for Neurons in Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)