Table of Contents
Fetching ...

LongGenBench: Long-context Generation Benchmark

Xiang Liu, Peijie Dong, Xuming Hu, Xiaowen Chu

TL;DR

A synthetic benchmark, LongGenBench, is introduced, which allows for flexible configurations of customized generation context lengths and advances beyond traditional benchmarks by redesigning the format of questions and necessitating that LLMs respond with a single, cohesive long-context answer.

Abstract

Current long-context benchmarks primarily focus on retrieval-based tests, requiring Large Language Models (LLMs) to locate specific information within extensive input contexts, such as the needle-in-a-haystack (NIAH) benchmark. Long-context generation refers to the ability of a language model to generate coherent and contextually accurate text that spans across lengthy passages or documents. While recent studies show strong performance on NIAH and other retrieval-based long-context benchmarks, there is a significant lack of benchmarks for evaluating long-context generation capabilities. To bridge this gap and offer a comprehensive assessment, we introduce a synthetic benchmark, LongGenBench, which allows for flexible configurations of customized generation context lengths. LongGenBench advances beyond traditional benchmarks by redesigning the format of questions and necessitating that LLMs respond with a single, cohesive long-context answer. Upon extensive evaluation using LongGenBench, we observe that: (1) both API accessed and open source models exhibit performance degradation in long-context generation scenarios, ranging from 1.2% to 47.1%; (2) different series of LLMs exhibit varying trends of performance degradation, with the Gemini-1.5-Flash model showing the least degradation among API accessed models, and the Qwen2 series exhibiting the least degradation in LongGenBench among open source models.

LongGenBench: Long-context Generation Benchmark

TL;DR

A synthetic benchmark, LongGenBench, is introduced, which allows for flexible configurations of customized generation context lengths and advances beyond traditional benchmarks by redesigning the format of questions and necessitating that LLMs respond with a single, cohesive long-context answer.

Abstract

Current long-context benchmarks primarily focus on retrieval-based tests, requiring Large Language Models (LLMs) to locate specific information within extensive input contexts, such as the needle-in-a-haystack (NIAH) benchmark. Long-context generation refers to the ability of a language model to generate coherent and contextually accurate text that spans across lengthy passages or documents. While recent studies show strong performance on NIAH and other retrieval-based long-context benchmarks, there is a significant lack of benchmarks for evaluating long-context generation capabilities. To bridge this gap and offer a comprehensive assessment, we introduce a synthetic benchmark, LongGenBench, which allows for flexible configurations of customized generation context lengths. LongGenBench advances beyond traditional benchmarks by redesigning the format of questions and necessitating that LLMs respond with a single, cohesive long-context answer. Upon extensive evaluation using LongGenBench, we observe that: (1) both API accessed and open source models exhibit performance degradation in long-context generation scenarios, ranging from 1.2% to 47.1%; (2) different series of LLMs exhibit varying trends of performance degradation, with the Gemini-1.5-Flash model showing the least degradation among API accessed models, and the Qwen2 series exhibiting the least degradation in LongGenBench among open source models.
Paper Structure (35 sections, 1 equation, 8 figures, 16 tables, 1 algorithm)

This paper contains 35 sections, 1 equation, 8 figures, 16 tables, 1 algorithm.

Figures (8)

  • Figure 1: Performance Comparison of LLMs on GSM8K and MMLU datasets using LongGenBench to assess their long-context generation capabilities. It is observed that mainstream LLMs exhibit performance degradation when tasked with long-context generation.
  • Figure 2: Illustrations of previous long-context benchmarks and our proposed approach. (a) Retrieval task: requires LLMs to retrieve the magic information hidden within an unrelated long context. (b) Understanding task: requires LLMs to comprehensively understand a long essay and answer the specific question. (c) Our approach: reconstructs the format of the dataset, requiring LLMs to sequentially understand and respond to each question in a single response. We run multiple iterations with different questions to evaluate the robustness of long-context generation capabilities. The length of the generated responses aims to approach the token limit.
  • Figure 3: Generation accuracy distribution of API accessed models in LongGenBench-GSM8K.
  • Figure 4: Generation accuracy distribution of open source models in LongGenBench-GSM8K.
  • Figure 5: Output Length Distribution in LongGenBench-GSM8k.
  • ...and 3 more figures