Table of Contents
Fetching ...

Bridging Latent Reasoning and Target-Language Generation via Retrieval-Transition Heads

Shaswat Patel, Vishvesh Trivedi, Yue Han, Yihuai Hong, Eunsol Choi

TL;DR

This work identifies Retrieval-Transition heads(RTH), which govern the transition to specific target-language output in cross-lingual setting, and advances understanding of multilingual LMs by isolating the attention heads responsible for mapping to target languages.

Abstract

Recent work has identified a subset of attention heads in Transformer as retrieval heads, which are responsible for retrieving information from the context. In this work, we first investigate retrieval heads in multilingual contexts. In multilingual language models, we find that retrieval heads are often shared across multiple languages. Expanding the study to cross-lingual setting, we identify Retrieval-Transition heads(RTH), which govern the transition to specific target-language output. Our experiments reveal that RTHs are distinct from retrieval heads and more vital for Chain-of-Thought reasoning in multilingual LLMs. Across four multilingual benchmarks (MMLU-ProX, MGSM, MLQA, and XQuaD) and two model families (Qwen-2.5 and Llama-3.1), we demonstrate that masking RTH induces bigger performance drop than masking Retrieval Heads (RH). Our work advances understanding of multilingual LMs by isolating the attention heads responsible for mapping to target languages.

Bridging Latent Reasoning and Target-Language Generation via Retrieval-Transition Heads

TL;DR

This work identifies Retrieval-Transition heads(RTH), which govern the transition to specific target-language output in cross-lingual setting, and advances understanding of multilingual LMs by isolating the attention heads responsible for mapping to target languages.

Abstract

Recent work has identified a subset of attention heads in Transformer as retrieval heads, which are responsible for retrieving information from the context. In this work, we first investigate retrieval heads in multilingual contexts. In multilingual language models, we find that retrieval heads are often shared across multiple languages. Expanding the study to cross-lingual setting, we identify Retrieval-Transition heads(RTH), which govern the transition to specific target-language output. Our experiments reveal that RTHs are distinct from retrieval heads and more vital for Chain-of-Thought reasoning in multilingual LLMs. Across four multilingual benchmarks (MMLU-ProX, MGSM, MLQA, and XQuaD) and two model families (Qwen-2.5 and Llama-3.1), we demonstrate that masking RTH induces bigger performance drop than masking Retrieval Heads (RH). Our work advances understanding of multilingual LMs by isolating the attention heads responsible for mapping to target languages.
Paper Structure (32 sections, 6 equations, 20 figures, 7 tables)

This paper contains 32 sections, 6 equations, 20 figures, 7 tables.

Figures (20)

  • Figure 1: Multilingual NIAH dataset construction pipeline. We sample needles, questions and haystacks, translate the needle and question into the target language, segment the haystack, and inject the translated needle to produce the final evaluation instances.
  • Figure 2: Language-specific (red) and shared (orange, green, blue) retrieval heads in three multilingual LMs for four languages - English (en), Chinese (zh), German (de), Swahili (sw). The size of each block is proportional to the number of retrieval heads that are language-exclusive (red), bilingual (orange), trilingual (green), or language-shared (blue). Qwen-2.5 7B exhibits strong multilingual head overlap (59% 4L), Llama-3.1 8B has high bilingual (mainly en-zh) head overlap (24.6% 2L), and Phi-3.5 3B shows a measurably higher language-specific (mainly en-specific) behavior (28.4% 1L) compared to the other two model families.
  • Figure 3: Illustration of the retrieval-transition score (RTS). Needle tokens in the source language are first aligned to their target-language equivalents using a LM. The degree of overlap between the model’s generated tokens and these aligned target needles quantifies each attention head’s contribution to cross-lingual retrieval.
  • Figure 4: Layer–head distributions of Retrieval Heads (RH) and Retrieval-Transition Heads (RTH) across languages for Qwen-2.5 7B Instruct. Colored cells are part of Top-50 most prominent heads ranked by score. Among these colored heads, we find language-specific RH are dominant in the final layers while RTH are prominent in the middle layers. Like RH heads, RTH heads too show sparsity with only 3-8% of heads with an RTS score above 0.1
  • Figure 5: Spearman correlation score between top-50 most influential $RH(\ell)$ and $RTH(\ell)$ in Qwen-2.5 7B. $RTH(\ell)$ heads share little correlation with $RH(\ell)$ but high correlations among other $RTH(\ell_0)$ and vice-versa
  • ...and 15 more figures