Table of Contents
Fetching ...

How a Bilingual LM Becomes Bilingual: Tracing Internal Representations with Sparse Autoencoders

Tatsuro Inaba, Go Kamoda, Kentaro Inui, Masaru Isonuma, Yusuke Miyao, Yohei Oseki, Benjamin Heinzerling, Yu Takagi

TL;DR

This work analyzes how bilingual representations emerge during pretraining by applying TopK-Sparse Autoencoders (TopK-SAE) to decompose hidden states into English-specific, Japanese-specific, and bilingual components. Using $K=32$ and hidden dimension $n=32{,}768$, the authors track representation formation across training steps, layers, and model sizes, revealing that languages are learned independently early on and bilingual alignment forms primarily in mid-layers of larger models. An intervention demonstrates causality: injecting bilingual representations from a fully trained model into a mid-training model yields notable performance gains, indicating that bilingual knowledge is crucial for final performance beyond monolingual signals. The findings suggest training schedules that emphasize bilingual alignment in later stages and demonstrate the versatility of SAEs for both analysis and targeted representation manipulation in multilingual LLMs.

Abstract

This study explores how bilingual language models develop complex internal representations. We employ sparse autoencoders to analyze internal representations of bilingual language models with a focus on the effects of training steps, layers, and model sizes. Our analysis shows that language models first learn languages separately, and then gradually form bilingual alignments, particularly in the mid layers. We also found that this bilingual tendency is stronger in larger models. Building on these findings, we demonstrate the critical role of bilingual representations in model performance by employing a novel method that integrates decomposed representations from a fully trained model into a mid-training model. Our results provide insights into how language models acquire bilingual capabilities.

How a Bilingual LM Becomes Bilingual: Tracing Internal Representations with Sparse Autoencoders

TL;DR

This work analyzes how bilingual representations emerge during pretraining by applying TopK-Sparse Autoencoders (TopK-SAE) to decompose hidden states into English-specific, Japanese-specific, and bilingual components. Using and hidden dimension , the authors track representation formation across training steps, layers, and model sizes, revealing that languages are learned independently early on and bilingual alignment forms primarily in mid-layers of larger models. An intervention demonstrates causality: injecting bilingual representations from a fully trained model into a mid-training model yields notable performance gains, indicating that bilingual knowledge is crucial for final performance beyond monolingual signals. The findings suggest training schedules that emphasize bilingual alignment in later stages and demonstrate the versatility of SAEs for both analysis and targeted representation manipulation in multilingual LLMs.

Abstract

This study explores how bilingual language models develop complex internal representations. We employ sparse autoencoders to analyze internal representations of bilingual language models with a focus on the effects of training steps, layers, and model sizes. Our analysis shows that language models first learn languages separately, and then gradually form bilingual alignments, particularly in the mid layers. We also found that this bilingual tendency is stronger in larger models. Building on these findings, we demonstrate the critical role of bilingual representations in model performance by employing a novel method that integrates decomposed representations from a fully trained model into a mid-training model. Our results provide insights into how language models acquire bilingual capabilities.

Paper Structure

This paper contains 28 sections, 5 equations, 17 figures, 4 tables.

Figures (17)

  • Figure 1: Illustration of the experimental setup (top) and the key findings (bottom). In the top panel, SAEs are trained independently on language models at each training stage, layer, and model size. The bottom panel visualizes the evolution of bilingual alignment, derived from comparisons of the features learned by each SAE.
  • Figure 2: The procedure for calculating Monosemanticity ($R_{\mathrm{mono}}(i)$) from Token Entropy ($H_{\mathrm{token}}(i)$) and Semantic Entropy ($H_{\mathrm{semantic}}(i)$) for the $i$-th feature.
  • Figure 3: (a) Language Distribution and (b) Semantic Distribution of SAE's features at the 14th layer of the 3.7B model across training stages. During early training ($\leq 4 \times 10^8$ tokens), the model exhibits a high proportion of mixed language features and low monosemanticity, indicating that features are activated by tokens from both languages without clear semantic coherence. As training continues ($4 \times 10^8$ -- $4 \times10^9$ tokens), the mixed language proportion decreases while monosemanticity increases, reflecting more language-specific and semantically coherent features. In the late training stage ($\geq 4 \times 10^9$ tokens), the mixed-language proportion rises again, but high monosemanticity is maintained, suggesting the emergence of bilingual semantic representations.
  • Figure 4: Activation patterns of features at the 14th layer of the 3.7B model across training stages. (a) In the early training stage ($4 \times 10^6$ tokens), features are activated by random tokens without any clear semantic structure. (b) In the mid-training stage ($4 \times 10^9$ tokens), features become more language-specific, with tokens activating on semantically similar words in a single language. (c) In the fully trained model ($2 \times 10^{12}$ tokens), features exhibit bilingual activation, with semantically related tokens appearing in both Japanese and English.
  • Figure 5: Layer-wise evolution of mixed language proportion and the monosemanticity in 3.7B model across training stages.
  • ...and 12 more figures