Evaluating Multilingual Long-Context Models for Retrieval and Reasoning

Ameeta Agrawal; Andy Dang; Sina Bagheri Nezhad; Rhitabrat Pokharel; Russell Scheinberg

Evaluating Multilingual Long-Context Models for Retrieval and Reasoning

Ameeta Agrawal, Andy Dang, Sina Bagheri Nezhad, Rhitabrat Pokharel, Russell Scheinberg

TL;DR

This paper addresses the problem of evaluating multilingual long-context LLMs for retrieval and reasoning across five languages with varying resource levels. It introduces the mLongRR dataset, combining naturally occurring BBC articles and translated needles in a needle-in-a-haystack paradigm, to test retrieval and multi-needle reasoning. The authors evaluate six models across context windows from $2k$ to $64k$ tokens and analyze the impact of task complexity, language resource level, and tokenization on performance; results show substantial declines with longer contexts and more needles, and pronounced gaps between languages and models. The work highlights the need for improved multilingual long-context modeling and tokenization strategies to enable reliable retrieval and reasoning in diverse languages.

Abstract

Recent large language models (LLMs) demonstrate impressive capabilities in handling long contexts, some exhibiting near-perfect recall on synthetic retrieval tasks. However, these evaluations have mainly focused on English text and involved a single target sentence within lengthy contexts. Our work investigates how LLM performance generalizes to multilingual settings with multiple hidden target sentences. We create a new dataset -- mLongRR -- to comprehensively evaluate several multilingual long-context LLMs on retrieval and reasoning tasks across five languages: English, Vietnamese, Indonesian, Swahili, and Somali. These languages share the Latin script but belong to distinct language families and resource levels. Our analysis reveals a significant performance gap between languages. The best-performing models such as Gemini-1.5 and GPT-4o, achieve around 96% accuracy in English to around 36% in Somali with a single target sentence. However, this accuracy drops to 40% in English and 0% in Somali when dealing with three target sentences. Our findings highlight the challenges long-context LLMs face when processing longer contexts, an increase in the number of target sentences, or languages of lower resource levels.

Evaluating Multilingual Long-Context Models for Retrieval and Reasoning

TL;DR

tokens and analyze the impact of task complexity, language resource level, and tokenization on performance; results show substantial declines with longer contexts and more needles, and pronounced gaps between languages and models. The work highlights the need for improved multilingual long-context modeling and tokenization strategies to enable reliable retrieval and reasoning in diverse languages.

Abstract

Paper Structure (19 sections, 11 figures, 2 tables)

This paper contains 19 sections, 11 figures, 2 tables.

Introduction
Related Work
Multilingual Needles in a Haystack for Retrieval and Reasoning Evaluation
Languages and Models
Retrieval and Reasoning Tasks
Retrieving a Needle ($n=1$)
Reasoning over Multiple Needles ($n>1$)
Creating mLongRR Dataset
Prompts
Experiments
Evaluation
Results and Discussion
Performance of different models across languages and tasks
Performance across varying needle depths and haystack lengths
Performance across different languages
...and 4 more sections

Figures (11)

Figure 1: Ablation results of comparing Paul Graham's essays and news articles serving as haystacks for English experiments tested using GPT-4 model.
Figure 2: Ablation results of comparing two different prompts.
Figure 3: Radar plots showing the performance of six language models (GPT-4, Gemini-1.5, Claude-3, Yarn-7b, Llama-3, GPT-4o) across five languages (English, Vietnamese, Indonesian, Swahili, Somali) in retrieval and reasoning tasks involving one, two, and three target sentences ("needles"). The three plots represent different task complexities: single needle retrieval ($n = 1$, left plot), two needle reasoning ($n = 2$, center plot), and three needle reasoning ($n = 3$, right plot).
Figure 4: Heatmap visualizations with varying depths on the $y$-axis and context lengths on the $x$-axis, showing average model performance over all the languages for both retrieval (top panel) and reasoning tasks (middle and bottom panels). The color gradient from white to dark green represents accuracy levels, with darker green indicating higher accuracy.
Figure 5: Language-specific heatmap visualizations with varying depths on the $y$-axis and context lengths on the $x$-axis, averaged over all the models, when $n=1$, $n=2$, and $n=3$.
...and 6 more figures

Evaluating Multilingual Long-Context Models for Retrieval and Reasoning

TL;DR

Abstract

Evaluating Multilingual Long-Context Models for Retrieval and Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (11)