Table of Contents
Fetching ...

LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs

Yuhao Wu, Ming Shan Hee, Zhiqing Hu, Roy Ka-Wei Lee

TL;DR

LongGenBench identifies a critical gap in existing benchmarks by focusing on the generation capabilities of long-context LLMs, not just retrieval or reasoning over long inputs. The paper defines a comprehensive evaluation framework with four real-world scenarios and three instruction types, assessing generation up to 32K tokens via CR, STIC-1, and STIC-2 metrics. Through extensive experiments on ten diverse LLMs, it shows that current models struggle to maintain instruction adherence and content coherence over long outputs, with performance deteriorating as length increases. The findings highlight key challenges in long-form generation and motivate future work on model architectures and instruction-tuning data that support sustained, coherent, instruction-following text over extended sequences, with significant implications for design proposals, technical documentation, and creative writing.

Abstract

Current benchmarks like Needle-in-a-Haystack (NIAH), Ruler, and Needlebench focus on models' ability to understand long-context input sequences but fail to capture a critical dimension: the generation of high-quality long-form text. Applications such as design proposals, technical documentation, and creative writing rely on coherent, instruction-following outputs over extended sequences - a challenge that existing benchmarks do not adequately address. To fill this gap, we introduce LongGenBench, a novel benchmark designed to rigorously evaluate large language models' (LLMs) ability to generate long text while adhering to complex instructions. Through tasks requiring specific events or constraints within generated text, LongGenBench evaluates model performance across four distinct scenarios, three instruction types, and two generation-lengths (16K and 32K tokens). Our evaluation of ten state-of-the-art LLMs reveals that, despite strong results on Ruler, all models struggled with long text generation on LongGenBench, particularly as text length increased. This suggests that current LLMs are not yet equipped to meet the demands of real-world, long-form text generation.

LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs

TL;DR

LongGenBench identifies a critical gap in existing benchmarks by focusing on the generation capabilities of long-context LLMs, not just retrieval or reasoning over long inputs. The paper defines a comprehensive evaluation framework with four real-world scenarios and three instruction types, assessing generation up to 32K tokens via CR, STIC-1, and STIC-2 metrics. Through extensive experiments on ten diverse LLMs, it shows that current models struggle to maintain instruction adherence and content coherence over long outputs, with performance deteriorating as length increases. The findings highlight key challenges in long-form generation and motivate future work on model architectures and instruction-tuning data that support sustained, coherent, instruction-following text over extended sequences, with significant implications for design proposals, technical documentation, and creative writing.

Abstract

Current benchmarks like Needle-in-a-Haystack (NIAH), Ruler, and Needlebench focus on models' ability to understand long-context input sequences but fail to capture a critical dimension: the generation of high-quality long-form text. Applications such as design proposals, technical documentation, and creative writing rely on coherent, instruction-following outputs over extended sequences - a challenge that existing benchmarks do not adequately address. To fill this gap, we introduce LongGenBench, a novel benchmark designed to rigorously evaluate large language models' (LLMs) ability to generate long text while adhering to complex instructions. Through tasks requiring specific events or constraints within generated text, LongGenBench evaluates model performance across four distinct scenarios, three instruction types, and two generation-lengths (16K and 32K tokens). Our evaluation of ten state-of-the-art LLMs reveals that, despite strong results on Ruler, all models struggled with long text generation on LongGenBench, particularly as text length increased. This suggests that current LLMs are not yet equipped to meet the demands of real-world, long-form text generation.
Paper Structure (41 sections, 4 equations, 4 figures, 7 tables, 1 algorithm)

This paper contains 41 sections, 4 equations, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: LongGenBench Overview: 1) Scenario Selection: Select from four scenarios—Diary, Menu Design, Skyscraper Design, and Urban Planning—each offered in both short and long versions to determine the main task prompt. 2) Task Instruction: Employ the template libraries SI (Single), RI (Range), and PI (Periodic) to generate tasks characterized by random times or locations, along with the corresponding prompts and verification sets. 3) Instruction Synthesis: Integrate all prompts generated in the prior step to create a comprehensive set of instructions with a final check-set. 4) Example: An illustration of Sophia's weekly diary task is provided as an example.
  • Figure 2: The right side of the figure illustrates the model's performance on specific instruction tasks at 16K as sequence length increases, whereas the left side depicts performance at 32K. All curves have been smoothed with a Moving Average.
  • Figure 3: Performance Comparison on three tasks settings
  • Figure 4: Performance Comparison on Ruler and LongGenBench Tasks