Multilingual Needle in a Haystack: Investigating Long-Context Behavior of Multilingual Large Language Models

Amey Hengle; Prasoon Bajpai; Soham Dan; Tanmoy Chakraborty

Multilingual Needle in a Haystack: Investigating Long-Context Behavior of Multilingual Large Language Models

Amey Hengle, Prasoon Bajpai, Soham Dan, Tanmoy Chakraborty

TL;DR

The paper tackles the problem of evaluating long-context capabilities of multilingual LLMs for information retrieval. It introduces MLNeedle, a benchmark that combines MLQA in seven languages with multilingual mMARCO distractors to test retrieval when the needle's language and position vary within a long context from $4K$ to $32K$ tokens. By testing four open-source models and analyzing exact vs existence accuracy, the study reveals a pronounced sensitivity to needle language and placement, with limited cross-lingual retrieval success as context grows, while distractor language has a smaller effect. These findings offer concrete guidance for designing robust multilingual long-context evaluation protocols and point to avenues for model improvements in cross-lingual information retrieval over extended sequences.

Abstract

While recent large language models (LLMs) demonstrate remarkable abilities in responding to queries in diverse languages, their ability to handle long multilingual contexts is unexplored. As such, a systematic evaluation of the long-context capabilities of LLMs in multilingual settings is crucial, specifically in the context of information retrieval. To address this gap, we introduce the MultiLingual Needle-in-a-Haystack (MLNeedle) test, designed to assess a model's ability to retrieve relevant information (the needle) from a collection of multilingual distractor texts (the haystack). This test serves as an extension of the multilingual question-answering task, encompassing both monolingual and cross-lingual retrieval. We evaluate four state-of-the-art LLMs on MLNeedle. Our findings reveal that model performance can vary significantly with language and needle position. Specifically, we observe that model performance is the lowest when the needle is (i) in a language outside the English language family and (ii) located in the middle of the input context. Furthermore, although some models claim a context size of $8k$ tokens or greater, none demonstrate satisfactory cross-lingual retrieval performance as the context length increases. Our analysis provides key insights into the long-context behavior of LLMs in multilingual settings to guide future evaluation protocols. To our knowledge, this is the first study to investigate the multilingual long-context behavior of LLMs.

Multilingual Needle in a Haystack: Investigating Long-Context Behavior of Multilingual Large Language Models

TL;DR

tokens. By testing four open-source models and analyzing exact vs existence accuracy, the study reveals a pronounced sensitivity to needle language and placement, with limited cross-lingual retrieval success as context grows, while distractor language has a smaller effect. These findings offer concrete guidance for designing robust multilingual long-context evaluation protocols and point to avenues for model improvements in cross-lingual information retrieval over extended sequences.

Abstract

tokens or greater, none demonstrate satisfactory cross-lingual retrieval performance as the context length increases. Our analysis provides key insights into the long-context behavior of LLMs in multilingual settings to guide future evaluation protocols. To our knowledge, this is the first study to investigate the multilingual long-context behavior of LLMs.

Paper Structure (25 sections, 1 equation, 10 figures, 6 tables)

This paper contains 25 sections, 1 equation, 10 figures, 6 tables.

Introduction
MultiLingual Needle in a Haystack
Experimental Setup
The MLNeedle Dataset
Constructing the Haystack ($H$).
Positioning the Needle ($N$).
Models
Evaluation Metric
Experimental Results
Effect of Changing the Needle Position
Effect of Changing the Needle Language
Effect of Changing the Haystack Language
Ablation Studies
Related Work
Multilingual Question Answering and Information Retrieval.
...and 10 more sections

Figures (10)

Figure 1: Monolingual long-context performance (accuracy in radial axis) for various LLMs averaged across different context sizes (4K, 8K, 16K, and 32K). We observe a considerable drop in performance for all languages except English, suggesting that multilingual LLMs struggle to process non-English (or non-Latin) long input contexts.
Figure 2: Example of a multilingual question-answering input from the MLNeedle dataset. (Top) There are no distractor documents and the same needle (highlighted in green) is present in English (Left) and Hindi (Right); (Bottom) There are distractor documents present (highlighted in red) and the same needle is present in English (Left) and Hindi (Right).
Figure 3: Effect of changing the language of answer document (needle).
Figure 4: Effect of changing position of answer document (needle).
Figure 5: Exact accuracy of models on varying sample sizes for evaluation. Solid lines denote the accuracy, and the shaded area denotes the standard error.
...and 5 more figures

Multilingual Needle in a Haystack: Investigating Long-Context Behavior of Multilingual Large Language Models

TL;DR

Abstract

Multilingual Needle in a Haystack: Investigating Long-Context Behavior of Multilingual Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (10)