Table of Contents
Fetching ...

Negative Pre-activations Differentiate Syntax

Linghao Kong, Angelina Ning, Micah Adler, Nir Shavit

Abstract

Modern large language models increasingly use smooth activation functions such as GELU or SiLU, allowing negative pre-activations to carry both signal and gradient. Nevertheless, many neuron-level interpretability analyses have historically focused on large positive activations, often implicitly treating the negative region as less informative, a carryover from the ReLU-era. We challenge this assumption and ask whether and how negative pre-activations are leveraged by models. We address this question by studying a sparse subpopulation of Wasserstein neurons whose output distributions deviate strongly from a Gaussian baseline and that functionally differentiate similar inputs. We show that this negative region plays an active role rather than reflecting a mere gradient optimization side effect. A minimal, sign-specific intervention that zeroes only the negative pre-activations of a small set of Wasserstein neurons substantially increases perplexity and sharply degrades grammatical performance on BLiMP and TSE, whereas both random and perplexity-matched ablations of many more non-Wasserstein neurons in their negative pre-activations leave grammatical performance largely intact. Conversely, on a suite of non-grammatical benchmarks, the perplexity-matched control ablation is more damaging than the Wasserstein neuron ablation, yielding a double dissociation between syntax and other capabilities. Part-of-speech analysis localizes the excess surprisal to syntactic scaffolding tokens, layer-specific interventions show that small local degradations accumulate across depth, and training-dynamics analysis reveals that the same sign-specific ablation becomes more harmful as Wasserstein neurons emerge and stabilize. Together, these results identify negative pre-activations in a sparse subpopulation of Wasserstein neurons as an actively used substrate for syntax in smooth-activation language models.

Negative Pre-activations Differentiate Syntax

Abstract

Modern large language models increasingly use smooth activation functions such as GELU or SiLU, allowing negative pre-activations to carry both signal and gradient. Nevertheless, many neuron-level interpretability analyses have historically focused on large positive activations, often implicitly treating the negative region as less informative, a carryover from the ReLU-era. We challenge this assumption and ask whether and how negative pre-activations are leveraged by models. We address this question by studying a sparse subpopulation of Wasserstein neurons whose output distributions deviate strongly from a Gaussian baseline and that functionally differentiate similar inputs. We show that this negative region plays an active role rather than reflecting a mere gradient optimization side effect. A minimal, sign-specific intervention that zeroes only the negative pre-activations of a small set of Wasserstein neurons substantially increases perplexity and sharply degrades grammatical performance on BLiMP and TSE, whereas both random and perplexity-matched ablations of many more non-Wasserstein neurons in their negative pre-activations leave grammatical performance largely intact. Conversely, on a suite of non-grammatical benchmarks, the perplexity-matched control ablation is more damaging than the Wasserstein neuron ablation, yielding a double dissociation between syntax and other capabilities. Part-of-speech analysis localizes the excess surprisal to syntactic scaffolding tokens, layer-specific interventions show that small local degradations accumulate across depth, and training-dynamics analysis reveals that the same sign-specific ablation becomes more harmful as Wasserstein neurons emerge and stabilize. Together, these results identify negative pre-activations in a sparse subpopulation of Wasserstein neurons as an actively used substrate for syntax in smooth-activation language models.

Paper Structure

This paper contains 24 sections, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Wasserstein neurons in ReLU vs non-ReLU LLMs. (a, b) In OPT-1.3B, a ReLU-based model, the dominant pre-activation mass resembles a somewhat Gaussian peak whose mode lies below zero, with an additional mildly multimodal positive tail. (c, d) In Pythia 1.4B, a GELU-based model, the dominant mass instead centers near zero, and the negative pre-activation region exhibits more pronounced multimodality, reflecting preservation of negative inputs. (e) The input output (IO) relationship of the Pythia Wasserstein neuron, showing that for pairs of inputs that are fairly similar, their outputs are still mapped far apart by the neuron. More details provided in Section \ref{['sec:WNdef']}. Neurons acquired from the up projection in the second MLP block of their respective models.
  • Figure 2: Wasserstein neuron emergence tracks grammatical accuracy. (a) Larger models tend to have Wasserstein neurons with greater maximum WD, and more neurons with slightly greater WD. Dotted lines indicate mean WD, and text indicates maximum WD of all neurons in layer. (b-d) share the same legend. (b) Wasserstein neurons arise rapidly during training, within roughly 50B tokens. The WD of the same cohort of Wasserstein neurons is calculated at each checkpoint. (c) Wasserstein neurons tend to start and stop learning faster than other neurons, as measured by the cosine dissimilarity, normalized to the layer average, between successive 10K-step checkpoints. (d) At various checkpoints in training, the WD of the Wasserstein neuron group is compared to the model's performance on TSE at that time, and they strongly correlate. All neurons from the up projection in each model. Shaded bands are one standard error of the mean.
  • Figure 3: Sign-specific perturbation of Wasserstein neurons disproportionately harms grammar. (a, b) share the same model colors. (a) Perplexity increases when clamping only the negative pre-activations of the top-WD neurons; random controls are much smaller. Numbers indicate starting and ending absolute perplexity. (b) Matching the perplexity increase from perturbing the entangled fraction of neurons requires an order of magnitude more non-entangled units. (c) Perturbing Wasserstein neurons uniquely impacts grammatical capabilities, even compared to the perplexity-matched control. In each model, the top $1\%$ Wasserstein neurons in each layer were perturbed for the benchmark. The least entangled $40\%$ of neurons in Llama, $50\%$ in Mistral, and $20\%$ in Qwen in each layer were used as the perplexity-matched control. (d) At a per token resolution, tokens associated with syntactical scaffolding incur a much higher surprisal for the $1\%$ Wasserstein perturbation compared to the perplexity matched control in Llama 3.1 8B. NLL differences were mean shifted by the global difference. Randomly sampled controls were acquired over ten trials. Error bars indicate one standard error of the mean. Raw scores and CI's are in Table \ref{['tab:grammarbenchmarks']}.
  • Figure 4: Non-grammatical abilities are comparatively less harmed by Wasserstein neuron perturbation. All benchmarks were run in 0-shot. Randomly sampled controls were acquired over ten trials. Error bars indicate one standard error of the mean. Raw scores and CI's are in Tables \ref{['tab:grammarbenchmarks']}, \ref{['tab:otherbenchmarks']}.
  • Figure 5: Individual and cumulative layerwise ablations. (a, b) share the same y-axis labels. (a) Early layer ablations yield the greatest increases in error, specifically within ellipsis and subject-verb agreement. (b) Error increases monotonically with cumulative ablation, with the strongest effects for ellipsis and subject-verb agreement. (c) TSE performance is also the most sensitive to early layer perturbation, especially for negative polarity item licensing. (d) Error for TSE grows monotonically as well. All benchmarks collected for $1\%$ Wasserstein perturbation per layer.
  • ...and 11 more figures