Table of Contents
Fetching ...

Language Arithmetics: Towards Systematic Language Neuron Identification and Manipulation

Daniil Gurgurov, Katharina Trinley, Yusser Al Ghussin, Tanja Baeumel, Josef van Genabith, Simon Ostermann

TL;DR

The paper investigates language-specific neurons in multilingual LLMs to understand internal language representations and enable controllable language use. It introduces Language Activation Probability Entropy (LAPE) to identify language-specific neurons and presents a minimally invasive additive intervention, termed language arithmetics, to steer outputs toward a target language without fine-tuning. Empirically, language-specific neurons concentrate in mid-to-late layers, with non-Latin scripts showing stronger specialization and limited cross-language overlap, while typologically related languages share neurons. Across five multilingual tasks, additive interventions outperform activation-replacement approaches, yielding notable gains in translation, QA, NLI, and comprehension, and revealing internal fallback mechanisms when dominant languages are suppressed. The study demonstrates a principled, interpretable method for neuron-level multilingual control with practical implications for improving cross-language performance and controllability in LLMs, while outlining avenues for broader evaluation and model-scale exploration.

Abstract

Large language models (LLMs) exhibit strong multilingual abilities, yet the neural mechanisms behind language-specific processing remain unclear. We analyze language-specific neurons in Llama-3.1-8B, Mistral-Nemo-12B, and Aya-Expanse-8B & 32B across 21 typologically diverse languages, identifying neurons that control language behavior. Using the Language Activation Probability Entropy (LAPE) method, we show that these neurons cluster in deeper layers, with non-Latin scripts showing greater specialization. Related languages share overlapping neurons, reflecting internal representations of linguistic proximity. Through language arithmetics, i.e. systematic activation addition and multiplication, we steer models to deactivate unwanted languages and activate desired ones, outperforming simpler replacement approaches. These interventions effectively guide behavior across five multilingual tasks: language forcing, translation, QA, comprehension, and NLI. Manipulation is more successful for high-resource languages, while typological similarity improves effectiveness. We also demonstrate that cross-lingual neuron steering enhances downstream performance and reveal internal "fallback" mechanisms for language selection when neurons are progressively deactivated. Our code is made publicly available at https://github.com/d-gurgurov/Language-Neurons-Manipulation.

Language Arithmetics: Towards Systematic Language Neuron Identification and Manipulation

TL;DR

The paper investigates language-specific neurons in multilingual LLMs to understand internal language representations and enable controllable language use. It introduces Language Activation Probability Entropy (LAPE) to identify language-specific neurons and presents a minimally invasive additive intervention, termed language arithmetics, to steer outputs toward a target language without fine-tuning. Empirically, language-specific neurons concentrate in mid-to-late layers, with non-Latin scripts showing stronger specialization and limited cross-language overlap, while typologically related languages share neurons. Across five multilingual tasks, additive interventions outperform activation-replacement approaches, yielding notable gains in translation, QA, NLI, and comprehension, and revealing internal fallback mechanisms when dominant languages are suppressed. The study demonstrates a principled, interpretable method for neuron-level multilingual control with practical implications for improving cross-language performance and controllability in LLMs, while outlining avenues for broader evaluation and model-scale exploration.

Abstract

Large language models (LLMs) exhibit strong multilingual abilities, yet the neural mechanisms behind language-specific processing remain unclear. We analyze language-specific neurons in Llama-3.1-8B, Mistral-Nemo-12B, and Aya-Expanse-8B & 32B across 21 typologically diverse languages, identifying neurons that control language behavior. Using the Language Activation Probability Entropy (LAPE) method, we show that these neurons cluster in deeper layers, with non-Latin scripts showing greater specialization. Related languages share overlapping neurons, reflecting internal representations of linguistic proximity. Through language arithmetics, i.e. systematic activation addition and multiplication, we steer models to deactivate unwanted languages and activate desired ones, outperforming simpler replacement approaches. These interventions effectively guide behavior across five multilingual tasks: language forcing, translation, QA, comprehension, and NLI. Manipulation is more successful for high-resource languages, while typological similarity improves effectiveness. We also demonstrate that cross-lingual neuron steering enhances downstream performance and reveal internal "fallback" mechanisms for language selection when neurons are progressively deactivated. Our code is made publicly available at https://github.com/d-gurgurov/Language-Neurons-Manipulation.

Paper Structure

This paper contains 27 sections, 3 equations, 28 figures, 8 tables.

Figures (28)

  • Figure 1: Success rates of language forcing when deactivating neurons for the input language and activating those of a target language for Llama-3.1-8B. The input question is presented in the language corresponding to the deactivated neurons. Top 5% of neurons are considered.
  • Figure 2: Neuron overlap between languages and language families in Llama-3.1-8B, based on the top 1% of neurons identified as language-specific. Diagonals show counts per language; off-diagonals show overlaps. Asterisks mark non-Latin script languages.
  • Figure 3: Layer-wise distribution of language-specific neurons for individual languages in Llama-3.1-8B. Other models exhibit similar patterns (Appendix \ref{['appendix:language_neurons']}).
  • Figure 4: Predicted probabilities of target languages for Llama-3.1 using logit lens outputs from each layer and FastText for language identification. Results for other models are shown in Appendix \ref{['app:logit_lens']}.
  • Figure 5: FLORES performance changes over the baseline (measured by BLEU score) when activating language-specific neurons for Mistral-Nemo (5%).
  • ...and 23 more figures