Multilingual Needle in a Haystack: Investigating Long-Context Behavior of Multilingual Large Language Models
Amey Hengle, Prasoon Bajpai, Soham Dan, Tanmoy Chakraborty
TL;DR
The paper tackles the problem of evaluating long-context capabilities of multilingual LLMs for information retrieval. It introduces MLNeedle, a benchmark that combines MLQA in seven languages with multilingual mMARCO distractors to test retrieval when the needle's language and position vary within a long context from $4K$ to $32K$ tokens. By testing four open-source models and analyzing exact vs existence accuracy, the study reveals a pronounced sensitivity to needle language and placement, with limited cross-lingual retrieval success as context grows, while distractor language has a smaller effect. These findings offer concrete guidance for designing robust multilingual long-context evaluation protocols and point to avenues for model improvements in cross-lingual information retrieval over extended sequences.
Abstract
While recent large language models (LLMs) demonstrate remarkable abilities in responding to queries in diverse languages, their ability to handle long multilingual contexts is unexplored. As such, a systematic evaluation of the long-context capabilities of LLMs in multilingual settings is crucial, specifically in the context of information retrieval. To address this gap, we introduce the MultiLingual Needle-in-a-Haystack (MLNeedle) test, designed to assess a model's ability to retrieve relevant information (the needle) from a collection of multilingual distractor texts (the haystack). This test serves as an extension of the multilingual question-answering task, encompassing both monolingual and cross-lingual retrieval. We evaluate four state-of-the-art LLMs on MLNeedle. Our findings reveal that model performance can vary significantly with language and needle position. Specifically, we observe that model performance is the lowest when the needle is (i) in a language outside the English language family and (ii) located in the middle of the input context. Furthermore, although some models claim a context size of $8k$ tokens or greater, none demonstrate satisfactory cross-lingual retrieval performance as the context length increases. Our analysis provides key insights into the long-context behavior of LLMs in multilingual settings to guide future evaluation protocols. To our knowledge, this is the first study to investigate the multilingual long-context behavior of LLMs.
