WASD: Locating Critical Neurons as Sufficient Conditions for Explaining and Controlling LLM Behavior

Haonan Yu; Junhao Liu; Zhenyu Yan; Haoran Lin; Xin Zhang

WASD: Locating Critical Neurons as Sufficient Conditions for Explaining and Controlling LLM Behavior

Haonan Yu, Junhao Liu, Zhenyu Yan, Haoran Lin, Xin Zhang

Abstract

Precise behavioral control of large language models (LLMs) is critical for complex applications. However, existing methods often incur high training costs, lack natural language controllability, or compromise semantic coherence. To bridge this gap, we propose WASD (unWeaving Actionable Sufficient Directives), a novel framework that explains model behavior by identifying sufficient neural conditions for token generation. Our method represents candidate conditions as neuron-activation predicates and iteratively searches for a minimal set that guarantees the current output under input perturbations. Experiments on SST-2 and CounterFact with the Gemma-2-2B model demonstrate that our approach produces explanations that are more stable, accurate, and concise than conventional attribution graphs. Moreover, through a case study on controlling cross-lingual output generation, we validated the practical effectiveness of WASD in controlling model behavior.

WASD: Locating Critical Neurons as Sufficient Conditions for Explaining and Controlling LLM Behavior

Abstract

Paper Structure (30 sections, 7 equations, 2 figures, 2 tables, 2 algorithms)

This paper contains 30 sections, 7 equations, 2 figures, 2 tables, 2 algorithms.

Introduction
Background
Architecture of Transformer
Circuit Tracer and Attribution Graphs
Sufficient Conditions for Model Behaviors
Methodology
Generating Predicates
Identifying Sufficient Conditions
Experiment
Experimental Setup
Evaluation Metrics
Precision (Fidelity).
Instability.
Size.
Experiment Results and Analysis
...and 15 more sections

Figures (2)

Figure 1: Comparison of language control capabilities between WASD and Circuit Tracer. In each subplot, the left panel displays the input, and the right panel displays the model's output following intervention. WASD controls the output by fixing specific neurons associated with the target language. In contrast, Circuit Tracer fixes the five highest-scoring neurons derived from the target language prompt's contribution map.
Figure 2: The workflow of WASD.

Theorems & Definitions (7)

Definition 2.1: Prompt
Definition 2.2: Weight Extraction
Definition 2.3: Predicate
Definition 2.4: Rule
Definition 2.5: Neighborhood
Definition 2.6: Precision
Definition 2.7: Sufficiency

WASD: Locating Critical Neurons as Sufficient Conditions for Explaining and Controlling LLM Behavior

Abstract

WASD: Locating Critical Neurons as Sufficient Conditions for Explaining and Controlling LLM Behavior

Authors

Abstract

Table of Contents

Figures (2)

Theorems & Definitions (7)