Evaluating Multilingual Long-Context Models for Retrieval and Reasoning
Ameeta Agrawal, Andy Dang, Sina Bagheri Nezhad, Rhitabrat Pokharel, Russell Scheinberg
TL;DR
This paper addresses the problem of evaluating multilingual long-context LLMs for retrieval and reasoning across five languages with varying resource levels. It introduces the mLongRR dataset, combining naturally occurring BBC articles and translated needles in a needle-in-a-haystack paradigm, to test retrieval and multi-needle reasoning. The authors evaluate six models across context windows from $2k$ to $64k$ tokens and analyze the impact of task complexity, language resource level, and tokenization on performance; results show substantial declines with longer contexts and more needles, and pronounced gaps between languages and models. The work highlights the need for improved multilingual long-context modeling and tokenization strategies to enable reliable retrieval and reasoning in diverse languages.
Abstract
Recent large language models (LLMs) demonstrate impressive capabilities in handling long contexts, some exhibiting near-perfect recall on synthetic retrieval tasks. However, these evaluations have mainly focused on English text and involved a single target sentence within lengthy contexts. Our work investigates how LLM performance generalizes to multilingual settings with multiple hidden target sentences. We create a new dataset -- mLongRR -- to comprehensively evaluate several multilingual long-context LLMs on retrieval and reasoning tasks across five languages: English, Vietnamese, Indonesian, Swahili, and Somali. These languages share the Latin script but belong to distinct language families and resource levels. Our analysis reveals a significant performance gap between languages. The best-performing models such as Gemini-1.5 and GPT-4o, achieve around 96% accuracy in English to around 36% in Somali with a single target sentence. However, this accuracy drops to 40% in English and 0% in Somali when dealing with three target sentences. Our findings highlight the challenges long-context LLMs face when processing longer contexts, an increase in the number of target sentences, or languages of lower resource levels.
