Bridging Latent Reasoning and Target-Language Generation via Retrieval-Transition Heads

Shaswat Patel; Vishvesh Trivedi; Yue Han; Yihuai Hong; Eunsol Choi

Bridging Latent Reasoning and Target-Language Generation via Retrieval-Transition Heads

Shaswat Patel, Vishvesh Trivedi, Yue Han, Yihuai Hong, Eunsol Choi

TL;DR

This work identifies Retrieval-Transition heads(RTH), which govern the transition to specific target-language output in cross-lingual setting, and advances understanding of multilingual LMs by isolating the attention heads responsible for mapping to target languages.

Abstract

Recent work has identified a subset of attention heads in Transformer as retrieval heads, which are responsible for retrieving information from the context. In this work, we first investigate retrieval heads in multilingual contexts. In multilingual language models, we find that retrieval heads are often shared across multiple languages. Expanding the study to cross-lingual setting, we identify Retrieval-Transition heads(RTH), which govern the transition to specific target-language output. Our experiments reveal that RTHs are distinct from retrieval heads and more vital for Chain-of-Thought reasoning in multilingual LLMs. Across four multilingual benchmarks (MMLU-ProX, MGSM, MLQA, and XQuaD) and two model families (Qwen-2.5 and Llama-3.1), we demonstrate that masking RTH induces bigger performance drop than masking Retrieval Heads (RH). Our work advances understanding of multilingual LMs by isolating the attention heads responsible for mapping to target languages.

Bridging Latent Reasoning and Target-Language Generation via Retrieval-Transition Heads

TL;DR

Abstract

Paper Structure (32 sections, 6 equations, 20 figures, 7 tables)

This paper contains 32 sections, 6 equations, 20 figures, 7 tables.

Introduction
Related Work
Understanding Multilingual Language Models
Attention heads in multilingual LLMs
Retrieval Heads in Diverse Languages
Background
Multilingual Experimental Setting
Analyzing distribution of retrieval heads across different languages
Retrieval-Transition heads
Experimental Setup
Evaluation of Retrieval-Transition Heads
The distribution of retrieval-transition heads
The role of retrieval-transition heads in multilingual reasoning performance
Qualitative Analysis
Conclusion and Future Work
...and 17 more sections

Figures (20)

Figure 1: Multilingual NIAH dataset construction pipeline. We sample needles, questions and haystacks, translate the needle and question into the target language, segment the haystack, and inject the translated needle to produce the final evaluation instances.
Figure 2: Language-specific (red) and shared (orange, green, blue) retrieval heads in three multilingual LMs for four languages - English (en), Chinese (zh), German (de), Swahili (sw). The size of each block is proportional to the number of retrieval heads that are language-exclusive (red), bilingual (orange), trilingual (green), or language-shared (blue). Qwen-2.5 7B exhibits strong multilingual head overlap (59% 4L), Llama-3.1 8B has high bilingual (mainly en-zh) head overlap (24.6% 2L), and Phi-3.5 3B shows a measurably higher language-specific (mainly en-specific) behavior (28.4% 1L) compared to the other two model families.
Figure 3: Illustration of the retrieval-transition score (RTS). Needle tokens in the source language are first aligned to their target-language equivalents using a LM. The degree of overlap between the model’s generated tokens and these aligned target needles quantifies each attention head’s contribution to cross-lingual retrieval.
Figure 4: Layer–head distributions of Retrieval Heads (RH) and Retrieval-Transition Heads (RTH) across languages for Qwen-2.5 7B Instruct. Colored cells are part of Top-50 most prominent heads ranked by score. Among these colored heads, we find language-specific RH are dominant in the final layers while RTH are prominent in the middle layers. Like RH heads, RTH heads too show sparsity with only 3-8% of heads with an RTS score above 0.1
Figure 5: Spearman correlation score between top-50 most influential $RH(\ell)$ and $RTH(\ell)$ in Qwen-2.5 7B. $RTH(\ell)$ heads share little correlation with $RH(\ell)$ but high correlations among other $RTH(\ell_0)$ and vice-versa
...and 15 more figures

Bridging Latent Reasoning and Target-Language Generation via Retrieval-Transition Heads

TL;DR

Abstract

Bridging Latent Reasoning and Target-Language Generation via Retrieval-Transition Heads

Authors

TL;DR

Abstract

Table of Contents

Figures (20)