Table of Contents
Fetching ...

Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages

Jannik Brinkmann, Chris Wendler, Christian Bartelt, Aaron Mueller

TL;DR

The paper investigates whether large language models learn shared, cross-lingual morphosyntactic representations. It uses gated sparse autoencoders to extract multilingual feature directions and applies attribution patching and probing classifiers to establish cross-language sharing and causal relevance. Through a machine translation case study, the authors demonstrate that steering these multilingual features can selectively alter outputs with limited side effects, suggesting robust cross-lingual abstractions even in English-dominated pretraining. The findings imply that LM internal representations may encode concepts as language-invariant abstractions, with practical implications for cross-lingual transfer, interpretability, and targeted model editing. Overall, the work provides causal evidence for multilingual, concept-level representations and their potential utility in multilingual NLP tasks.

Abstract

Human bilinguals often use similar brain regions to process multiple languages, depending on when they learned their second language and their proficiency. In large language models (LLMs), how are multiple languages learned and encoded? In this work, we explore the extent to which LLMs share representations of morphsyntactic concepts such as grammatical number, gender, and tense across languages. We train sparse autoencoders on Llama-3-8B and Aya-23-8B, and demonstrate that abstract grammatical concepts are often encoded in feature directions shared across many languages. We use causal interventions to verify the multilingual nature of these representations; specifically, we show that ablating only multilingual features decreases classifier performance to near-chance across languages. We then use these features to precisely modify model behavior in a machine translation task; this demonstrates both the generality and selectivity of these feature's roles in the network. Our findings suggest that even models trained predominantly on English data can develop robust, cross-lingual abstractions of morphosyntactic concepts.

Large Language Models Share Representations of Latent Grammatical Concepts Across Typologically Diverse Languages

TL;DR

The paper investigates whether large language models learn shared, cross-lingual morphosyntactic representations. It uses gated sparse autoencoders to extract multilingual feature directions and applies attribution patching and probing classifiers to establish cross-language sharing and causal relevance. Through a machine translation case study, the authors demonstrate that steering these multilingual features can selectively alter outputs with limited side effects, suggesting robust cross-lingual abstractions even in English-dominated pretraining. The findings imply that LM internal representations may encode concepts as language-invariant abstractions, with practical implications for cross-lingual transfer, interpretability, and targeted model editing. Overall, the work provides causal evidence for multilingual, concept-level representations and their potential utility in multilingual NLP tasks.

Abstract

Human bilinguals often use similar brain regions to process multiple languages, depending on when they learned their second language and their proficiency. In large language models (LLMs), how are multiple languages learned and encoded? In this work, we explore the extent to which LLMs share representations of morphsyntactic concepts such as grammatical number, gender, and tense across languages. We train sparse autoencoders on Llama-3-8B and Aya-23-8B, and demonstrate that abstract grammatical concepts are often encoded in feature directions shared across many languages. We use causal interventions to verify the multilingual nature of these representations; specifically, we show that ablating only multilingual features decreases classifier performance to near-chance across languages. We then use these features to precisely modify model behavior in a machine translation task; this demonstrates both the generality and selectivity of these feature's roles in the network. Our findings suggest that even models trained predominantly on English data can develop robust, cross-lingual abstractions of morphosyntactic concepts.
Paper Structure (47 sections, 11 equations, 18 figures, 3 tables)

This paper contains 47 sections, 11 equations, 18 figures, 3 tables.

Figures (18)

  • Figure 1: Using sparse autoencoders, we find that language models share representations of grammatical concepts across languages. By intervening on these multilingual representations, we can change the model behavior given inputs in different languages. For example, we can make the model predict plural verbs in different languages by activating the same plural feature.
  • Figure 2: Proportion of features shared across languages (intersection over union) among the top 32 features for each morphosyntactic concept. A significant fraction of the morphosyntactic concept representations are shared across languages in both Llama-3-8B and Aya-23-8B.
  • Figure 3: Examples of the activation patterns of selected features in Llama-3-8B that correspond to cross-lingual representations of grammatical concepts. For example, we locate features that indicate the presence of plural nouns or features that indicate past tense across languages.
  • Figure 4: Performance of the probing classifiers before and after ablating features. Specifically, we test ablating (a) all monolingual features, (b) all multilingual features, and (c) the upper quartile of the most multilingual features. We find that the classifiers crucialy rely on massively multilingual features to predict the presence of morphosyntactic features.
  • Figure 5: Efficacy in flipping the model behavior on our dataset when translating while intervening on a single multilingual feature per concept. For each concept, we translate a sentence containing some concept (e.g., present tense) from some source language to another language while intervening on a feature, and measure the number of times the model generates a translation containing the counterfactual concept value (e.g. past tense). In each setting, we intervene on a single feature and measure the success rate over 64 examples. Results aggregated across translation directions; see App. \ref{['app:mt-directional']} for separate results per direction.
  • ...and 13 more figures