Table of Contents
Fetching ...

Polysemanticity or Polysemy? Lexical Identity Confounds Superposition Metrics

Iyad Ait Hou, Rebecca Hwa

Abstract

If the same neuron activates for both "lender" and "riverside," standard metrics attribute the overlap to superposition--the neuron must be compressing two unrelated concepts. This work explores how much of the overlap is due a lexical confound: neurons fire for a shared word form (such as "bank") rather than for two compressed concepts. A 2x2 factorial decomposition reveals that the lexical-only condition (same word, different meaning) consistently exceeds the semantic-only condition (different word, same meaning) across models spanning 110M-70B parameters. The confound carries into sparse autoencoders (18-36% of features blend senses), sits in <=1% of activation dimensions, and hurts downstream tasks: filtering it out improves word sense disambiguation and makes knowledge edits more selective (p = 0.002).

Polysemanticity or Polysemy? Lexical Identity Confounds Superposition Metrics

Abstract

If the same neuron activates for both "lender" and "riverside," standard metrics attribute the overlap to superposition--the neuron must be compressing two unrelated concepts. This work explores how much of the overlap is due a lexical confound: neurons fire for a shared word form (such as "bank") rather than for two compressed concepts. A 2x2 factorial decomposition reveals that the lexical-only condition (same word, different meaning) consistently exceeds the semantic-only condition (different word, same meaning) across models spanning 110M-70B parameters. The confound carries into sparse autoencoders (18-36% of features blend senses), sits in <=1% of activation dimensions, and hurts downstream tasks: filtering it out improves word sense disambiguation and makes knowledge edits more selective (p = 0.002).

Paper Structure

This paper contains 68 sections, 3 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Overview. Standard metrics (top) see that neurons $n_1$--$n_5$ fire for both senses of bank and call all of it superposition. Our decomposition (bottom) shows that most shared neurons are sense-blind---they encode the word form, not compressed concepts. Only a small remainder is genuine superposition.
  • Figure 2: $2 \times 2$ factorial decomposition design.
  • Figure 3: Top: Layer-averaged cosine similarity by condition across nine models (110M--70B). PS consistently exceeds SYN: word form drives more overlap than shared meaning. Bottom: Per-layer breakdown showing the pattern holds across layers. $R_\text{lex}$ trends in Appendix \ref{['app:results']}.
  • Figure 4: SAE collision analysis (GPT-2). Left: mean features per word by sense discriminability. Right: collision ratio across layers. 18--32% of features conflate senses of the same word (18--36% including Pythia-410M; Section \ref{['sec:results-sae']}).
  • Figure 5: Sense-selective ROME editing. Standard ROME modifies all activated neurons, disturbing both senses. Our approach classifies neurons first, shields sense-blind (lexical-form) neurons from modification, and edits only sense-selective ones---preserving the unedited sense.
  • ...and 6 more figures