Table of Contents
Fetching ...

Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

Chonghua Wang, Haodong Duan, Songyang Zhang, Dahua Lin, Kai Chen

TL;DR

Ada-LEval introduces a length-adaptable benchmark for evaluating ultra-long-context understanding in LLMs via two tasks, TSort and BestAnswer. It systematically varies test-case lengths (including 32k–128k tokens) and uses data from Project Gutenberg and Stack Overflow, enabling precise accuracy-based evaluation and ablation studies. The results reveal that current proprietary models still struggle in ultra-long contexts, open-source LLMs lag behind, and even advanced position-embedding approaches only partly mitigate the limitations. The work provides the first structured ultra-long-context evaluation framework and benchmarks to guide future development of long-context LLMs.

Abstract

Recently, the large language model (LLM) community has shown increasing interest in enhancing LLMs' capability to handle extremely long documents. As various long-text techniques and model architectures emerge, the precise and detailed evaluation of models' long-text capabilities has become increasingly important. Existing long-text evaluation benchmarks, such as L-Eval and LongBench, construct long-text test sets based on open-source datasets, focusing mainly on QA and summarization tasks. These datasets include test samples of varying lengths (from 2k to 32k+) entangled together, making it challenging to assess model capabilities across different length ranges. Moreover, they do not cover the ultralong settings (100k+ tokens) that the latest LLMs claim to achieve. In this paper, we introduce Ada-LEval, a length-adaptable benchmark for evaluating the long-context understanding of LLMs. Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities. These benchmarks support intricate manipulation of the length of test cases, and can easily produce text samples up to 128k tokens. We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval. The evaluation results demonstrate the limitations of current LLMs, especially in ultra-long-context settings. Our code is available at https://github.com/open-compass/Ada-LEval.

Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks

TL;DR

Ada-LEval introduces a length-adaptable benchmark for evaluating ultra-long-context understanding in LLMs via two tasks, TSort and BestAnswer. It systematically varies test-case lengths (including 32k–128k tokens) and uses data from Project Gutenberg and Stack Overflow, enabling precise accuracy-based evaluation and ablation studies. The results reveal that current proprietary models still struggle in ultra-long contexts, open-source LLMs lag behind, and even advanced position-embedding approaches only partly mitigate the limitations. The work provides the first structured ultra-long-context evaluation framework and benchmarks to guide future development of long-context LLMs.

Abstract

Recently, the large language model (LLM) community has shown increasing interest in enhancing LLMs' capability to handle extremely long documents. As various long-text techniques and model architectures emerge, the precise and detailed evaluation of models' long-text capabilities has become increasingly important. Existing long-text evaluation benchmarks, such as L-Eval and LongBench, construct long-text test sets based on open-source datasets, focusing mainly on QA and summarization tasks. These datasets include test samples of varying lengths (from 2k to 32k+) entangled together, making it challenging to assess model capabilities across different length ranges. Moreover, they do not cover the ultralong settings (100k+ tokens) that the latest LLMs claim to achieve. In this paper, we introduce Ada-LEval, a length-adaptable benchmark for evaluating the long-context understanding of LLMs. Ada-LEval includes two challenging subsets, TSort and BestAnswer, which enable a more reliable evaluation of LLMs' long context capabilities. These benchmarks support intricate manipulation of the length of test cases, and can easily produce text samples up to 128k tokens. We evaluate 4 state-of-the-art closed-source API models and 6 open-source models with Ada-LEval. The evaluation results demonstrate the limitations of current LLMs, especially in ultra-long-context settings. Our code is available at https://github.com/open-compass/Ada-LEval.
Paper Structure (23 sections, 2 figures, 13 tables)

This paper contains 23 sections, 2 figures, 13 tables.

Figures (2)

  • Figure 1: The demonstration of two tasks: TSort and BestAnswer introduced in Ada-LEval. Understanding and reasoning over the full text are required to solve these two tasks.
  • Figure 2: The instruction following rate of LLMs on TSort (Left) and BestAnswer (Right) under long-context settings. GPT-4-Turbo on TSort and all proprietary models on BestAnswer achieve 100% instruction following rate across all long-context settings, thus not displayed.