Table of Contents
Fetching ...

Counting-Stars: A Multi-evidence, Position-aware, and Scalable Benchmark for Evaluating Long-Context Large Language Models

Mingyang Song, Mao Zheng, Xuan Luo

TL;DR

This paper introduces Counting-Stars, a scalable, position-aware benchmark for evaluating long-context LLMs on two counting-based sub-tasks: multi-evidence searching and multi-evidence reasoning. By varying context length and the amount of inserted evidence, Counting-Stars assesses how well models retrieve and reason over dispersed information, revealing a length-stability dynamic and limited support for the lost-in-the-middle effect. Experimental results across several leading LLMs show Gemini 1.5 Pro achieving the best overall performance, with GPT-4 Turbo offering the most stable results across tasks, while all models degrade as context grows. The work highlights room for improvement in long-context capabilities and outlines future directions to broaden model coverage and task complexity.

Abstract

Despite recent efforts to develop large language models with robust long-context capabilities, the lack of long-context benchmarks means that relatively little is known about their performance. To alleviate this gap, in this paper, we propose \textbf{Counting-Stars}, a multi-evidence, position-aware, and scalable benchmark designed to evaluate the multi-evidence retrieval capabilities of long-context LLMs. \textbf{Counting-Stars} comprises two counting-based multiple pieces of evidence retrieval sub-tasks: searching and reasoning. Using Counting-Stars, we conduct experiments to evaluate several long-context LLMs, including GPT-4 Turbo, Gemini 1.5 Pro, Claude3 Opus, GLM-4, and Moonshot-v1. Extensive experimental results demonstrate that Gemini 1.5 Pro achieves the best overall results, while GPT-4 Turbo exhibits the most stable performance across various tasks. Furthermore, our analysis of these LLMs, which have been extended to handle long-context scenarios, indicates that significant room for improvement remains as the length of the input context and the complexity of the tasks increase.

Counting-Stars: A Multi-evidence, Position-aware, and Scalable Benchmark for Evaluating Long-Context Large Language Models

TL;DR

This paper introduces Counting-Stars, a scalable, position-aware benchmark for evaluating long-context LLMs on two counting-based sub-tasks: multi-evidence searching and multi-evidence reasoning. By varying context length and the amount of inserted evidence, Counting-Stars assesses how well models retrieve and reason over dispersed information, revealing a length-stability dynamic and limited support for the lost-in-the-middle effect. Experimental results across several leading LLMs show Gemini 1.5 Pro achieving the best overall performance, with GPT-4 Turbo offering the most stable results across tasks, while all models degrade as context grows. The work highlights room for improvement in long-context capabilities and outlines future directions to broaden model coverage and task complexity.

Abstract

Despite recent efforts to develop large language models with robust long-context capabilities, the lack of long-context benchmarks means that relatively little is known about their performance. To alleviate this gap, in this paper, we propose \textbf{Counting-Stars}, a multi-evidence, position-aware, and scalable benchmark designed to evaluate the multi-evidence retrieval capabilities of long-context LLMs. \textbf{Counting-Stars} comprises two counting-based multiple pieces of evidence retrieval sub-tasks: searching and reasoning. Using Counting-Stars, we conduct experiments to evaluate several long-context LLMs, including GPT-4 Turbo, Gemini 1.5 Pro, Claude3 Opus, GLM-4, and Moonshot-v1. Extensive experimental results demonstrate that Gemini 1.5 Pro achieves the best overall results, while GPT-4 Turbo exhibits the most stable performance across various tasks. Furthermore, our analysis of these LLMs, which have been extended to handle long-context scenarios, indicates that significant room for improvement remains as the length of the input context and the complexity of the tasks increase.
Paper Structure (14 sections, 3 figures, 7 tables)

This paper contains 14 sections, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Illustration of how to scatter stars into the long context with the length of 96K.
  • Figure 2: Visualization of the results on the Chinese version of the Counting-Stars-32-(Multi-evidence Searching).
  • Figure 3: Visualization of the results on the Chinese version of the Counting-Stars-32-(Multi-evidence Reasoning).