Table of Contents
Fetching ...

Causal Language Control in Multilingual Transformers via Sparse Feature Steering

Cheng-Ting Chou, George Liu, Jessica Sun, Cole Blondin, Kevin Zhu, Vasu Sharma, Sean O'Brien

TL;DR

This work tackles deterministic language control in zero-shot multilingual transformers by introducing sparse autoencoder (SAE) feature steering. By identifying language-specific SAE features and intervening on them during inference, the method shifts generated text into target languages (Chinese, Japanese, Spanish, French) while preserving semantic content, achieving up to ~90% success in some cases. Layerwise analysis reveals that mid-to-late transformer layers and certain attention heads amplify language-directed signals, offering a mechanistic view of how steerable representations emerge. The approach provides a lightweight, interpretable alternative to prompts or retraining for multilingual generation control, with implications for language-specific generation, alignment, and safety in large multilingual models.

Abstract

Deterministically controlling the target generation language of large multilingual language models (LLMs) remains a fundamental challenge, particularly in zero-shot settings where neither explicit language prompts nor fine-tuning are available. In this work, we investigate whether sparse autoencoder (SAE) features, previously shown to correlate with interpretable model behaviors, can be leveraged to steer the generated language of LLMs during inference. Leveraging pretrained SAEs on the residual streams of Gemma-2B and Gemma-9B, we identify features whose activations differ most significantly between English and four target languages: Chinese, Japanese, Spanish, and French. By modifying just a single SAE feature at one transformer layer, we achieve controlled language shifts with up to 90\% success, as measured by FastText language classification, while preserving semantic fidelity according to LaBSE (Language-Agnostic BERT Sentence Embedding) similarity. Our analysis reveals that language steering is most effective in mid-to-late transformer layers and is amplified by specific attention heads disproportionately associated with language-sensitive SAE features. These results demonstrate the promise of sparse feature steering as a lightweight and interpretable mechanism for controllable multilingual generation.

Causal Language Control in Multilingual Transformers via Sparse Feature Steering

TL;DR

This work tackles deterministic language control in zero-shot multilingual transformers by introducing sparse autoencoder (SAE) feature steering. By identifying language-specific SAE features and intervening on them during inference, the method shifts generated text into target languages (Chinese, Japanese, Spanish, French) while preserving semantic content, achieving up to ~90% success in some cases. Layerwise analysis reveals that mid-to-late transformer layers and certain attention heads amplify language-directed signals, offering a mechanistic view of how steerable representations emerge. The approach provides a lightweight, interpretable alternative to prompts or retraining for multilingual generation control, with implications for language-specific generation, alignment, and safety in large multilingual models.

Abstract

Deterministically controlling the target generation language of large multilingual language models (LLMs) remains a fundamental challenge, particularly in zero-shot settings where neither explicit language prompts nor fine-tuning are available. In this work, we investigate whether sparse autoencoder (SAE) features, previously shown to correlate with interpretable model behaviors, can be leveraged to steer the generated language of LLMs during inference. Leveraging pretrained SAEs on the residual streams of Gemma-2B and Gemma-9B, we identify features whose activations differ most significantly between English and four target languages: Chinese, Japanese, Spanish, and French. By modifying just a single SAE feature at one transformer layer, we achieve controlled language shifts with up to 90\% success, as measured by FastText language classification, while preserving semantic fidelity according to LaBSE (Language-Agnostic BERT Sentence Embedding) similarity. Our analysis reveals that language steering is most effective in mid-to-late transformer layers and is amplified by specific attention heads disproportionately associated with language-sensitive SAE features. These results demonstrate the promise of sparse feature steering as a lightweight and interpretable mechanism for controllable multilingual generation.

Paper Structure

This paper contains 36 sections, 4 equations, 35 figures, 2 tables.

Figures (35)

  • Figure 1: LaBSE semantic similarity scores for steered outputs across Gemma-2-9B layers, using last-token-selected language features for Chinese (CMN), Spanish (SPA), Japanese (JPN), and French (FRA). The results show that steering effectiveness varies across layers, with peak semantic alignment occurring at mid to late layers for different languages.
  • Figure 2: FastText classification probabilities of the same steered outputs, revealing layer-specific differences in how strongly outputs reflect the target language. Later layers generally show higher classification confidence, indicating greater controllability through steering at those depths.
  • Figure 3: Top 3 contributing attention heads at Layer 29 across all Input $\rightarrow$ Feature language pairs. Each subplot shows the three attention heads with the highest contribution to the language-specific SAE feature when the model is given input in a different language. Head 12 is highlighted in red when it appears. The strong, selective dominance of Head 12 in all on-diagonal cases (e.g., cmn → cmn, jpn → jpn, fra → fra) but not off-diagonal cases suggests it plays a role in language-specific representation rather than general-purpose amplification.
  • Figure 4: Layer 23
  • Figure 5: Layer 29
  • ...and 30 more figures