Table of Contents
Fetching ...

Can LLMs reason over extended multilingual contexts? Towards long-context evaluation beyond retrieval and haystacks

Amey Hengle, Prasoon Bajpai, Soham Dan, Tanmoy Chakraborty

TL;DR

This work introduces MLRBench, a synthetic, multilingual long-context benchmark designed to evaluate reasoning beyond surface-level retrieval across seven languages. It demonstrates that LLMs struggle with long-context reasoning, with performance degrading as linguistic distance from English increases and with a sizable gap between retrieval and multi-step reasoning tasks. The study shows that effective context use is limited (roughly $25\%$–$30\%$ of advertised context), and while Retrieval Augmented Generation helps, it does not fully resolve long-context reasoning challenges. By open-sourcing MLRBench, the authors provide a valuable resource to spur improved evaluation and training of multilingual LLMs for extended-context tasks and to guide future research toward truly robust multilingual long-context reasoning.

Abstract

Existing multilingual long-context benchmarks, often based on the popular needle-in-a-haystack test, primarily evaluate a model's ability to locate specific information buried within irrelevant texts. However, such a retrieval-centric approach is myopic and inherently limited, as successful recall alone does not indicate a model's capacity to reason over extended contexts. Moreover, these benchmarks are susceptible to data leakage, short-circuiting, and risk making the evaluation a priori identifiable. To address these limitations, we introduce MLRBench, a new synthetic benchmark for multilingual long-context reasoning. Unlike existing benchmarks, MLRBench goes beyond surface-level retrieval by including tasks that assess multi-hop inference, aggregation, and epistemic reasoning. Spanning seven languages, MLRBench is designed to be parallel, resistant to leakage, and scalable to arbitrary context lengths. Our extensive experiments with an open-weight large language model (LLM) reveal a pronounced gap between high- and low-resource languages, particularly for tasks requiring the model to aggregate multiple facts or predict the absence of information. We also find that, in multilingual settings, LLMs effectively utilize less than 30% of their claimed context length. Although off-the-shelf Retrieval Augmented Generation helps alleviate this to a certain extent, it does not solve the long-context problem. We open-source MLRBench to enable future research in improved evaluation and training of multilingual LLMs.

Can LLMs reason over extended multilingual contexts? Towards long-context evaluation beyond retrieval and haystacks

TL;DR

This work introduces MLRBench, a synthetic, multilingual long-context benchmark designed to evaluate reasoning beyond surface-level retrieval across seven languages. It demonstrates that LLMs struggle with long-context reasoning, with performance degrading as linguistic distance from English increases and with a sizable gap between retrieval and multi-step reasoning tasks. The study shows that effective context use is limited (roughly of advertised context), and while Retrieval Augmented Generation helps, it does not fully resolve long-context reasoning challenges. By open-sourcing MLRBench, the authors provide a valuable resource to spur improved evaluation and training of multilingual LLMs for extended-context tasks and to guide future research toward truly robust multilingual long-context reasoning.

Abstract

Existing multilingual long-context benchmarks, often based on the popular needle-in-a-haystack test, primarily evaluate a model's ability to locate specific information buried within irrelevant texts. However, such a retrieval-centric approach is myopic and inherently limited, as successful recall alone does not indicate a model's capacity to reason over extended contexts. Moreover, these benchmarks are susceptible to data leakage, short-circuiting, and risk making the evaluation a priori identifiable. To address these limitations, we introduce MLRBench, a new synthetic benchmark for multilingual long-context reasoning. Unlike existing benchmarks, MLRBench goes beyond surface-level retrieval by including tasks that assess multi-hop inference, aggregation, and epistemic reasoning. Spanning seven languages, MLRBench is designed to be parallel, resistant to leakage, and scalable to arbitrary context lengths. Our extensive experiments with an open-weight large language model (LLM) reveal a pronounced gap between high- and low-resource languages, particularly for tasks requiring the model to aggregate multiple facts or predict the absence of information. We also find that, in multilingual settings, LLMs effectively utilize less than 30% of their claimed context length. Although off-the-shelf Retrieval Augmented Generation helps alleviate this to a certain extent, it does not solve the long-context problem. We open-source MLRBench to enable future research in improved evaluation and training of multilingual LLMs.

Paper Structure

This paper contains 15 sections, 1 equation, 13 figures, 9 tables.

Figures (13)

  • Figure 7: Performance comparison across different languages: (a) Performance comparison of prompt-based and RAG methods for best-performing language (en), second-best performing language (es) and worst-performing language (zh). (b) Performance comparison of prompt-based and RAG methods for different languages across short context length ($4k$) and longer context lengths ($\geq$$32k$). (c) Performance of two RAG methods across short context length ($4k$) and longer context lengths ($\geq$$32k$).
  • Figure 8: Task-wise performance of Llama-3.1-Instruct on selected languages. We group the tasks into four categories as discussed in Section \ref{['sec:task_categories']}.
  • Figure 9: Performance comparison between the best RAG method and the best prompt-only method for (a) short context lengths ($4k-16k$) and (b) long context lengths ($\geq$$32k$).
  • Figure 10: Ablation studies: (a) Performance of two different RAG methods with different numbers of retrieved documents. (b) Effect of temperature scaling at 0k context size and 8k context size. (c) Performance across different context sizes is affected by the type of distractors used to increase the context size.
  • Figure 11: Exact accuracy of Llama 3.1-Instruct after varying the sample size in the evaluation (test) set. Solid lines represent accuracy, while shaded areas indicate the standard error. Legend (by color): en (blue), de (orange), es (green), zh (red), vi (purple), hi (brown), ar (pink).
  • ...and 8 more figures