Counting-Stars: A Multi-evidence, Position-aware, and Scalable Benchmark for Evaluating Long-Context Large Language Models

Mingyang Song; Mao Zheng; Xuan Luo

Counting-Stars: A Multi-evidence, Position-aware, and Scalable Benchmark for Evaluating Long-Context Large Language Models

Mingyang Song, Mao Zheng, Xuan Luo

TL;DR

This paper introduces Counting-Stars, a scalable, position-aware benchmark for evaluating long-context LLMs on two counting-based sub-tasks: multi-evidence searching and multi-evidence reasoning. By varying context length and the amount of inserted evidence, Counting-Stars assesses how well models retrieve and reason over dispersed information, revealing a length-stability dynamic and limited support for the lost-in-the-middle effect. Experimental results across several leading LLMs show Gemini 1.5 Pro achieving the best overall performance, with GPT-4 Turbo offering the most stable results across tasks, while all models degrade as context grows. The work highlights room for improvement in long-context capabilities and outlines future directions to broaden model coverage and task complexity.

Abstract

Despite recent efforts to develop large language models with robust long-context capabilities, the lack of long-context benchmarks means that relatively little is known about their performance. To alleviate this gap, in this paper, we propose \textbf{Counting-Stars}, a multi-evidence, position-aware, and scalable benchmark designed to evaluate the multi-evidence retrieval capabilities of long-context LLMs. \textbf{Counting-Stars} comprises two counting-based multiple pieces of evidence retrieval sub-tasks: searching and reasoning. Using Counting-Stars, we conduct experiments to evaluate several long-context LLMs, including GPT-4 Turbo, Gemini 1.5 Pro, Claude3 Opus, GLM-4, and Moonshot-v1. Extensive experimental results demonstrate that Gemini 1.5 Pro achieves the best overall results, while GPT-4 Turbo exhibits the most stable performance across various tasks. Furthermore, our analysis of these LLMs, which have been extended to handle long-context scenarios, indicates that significant room for improvement remains as the length of the input context and the complexity of the tasks increase.

Counting-Stars: A Multi-evidence, Position-aware, and Scalable Benchmark for Evaluating Long-Context Large Language Models

TL;DR

Abstract

Counting-Stars: A Multi-evidence, Position-aware, and Scalable Benchmark for Evaluating Long-Context Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)