LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs
Yuhao Wu, Ming Shan Hee, Zhiqing Hu, Roy Ka-Wei Lee
TL;DR
LongGenBench identifies a critical gap in existing benchmarks by focusing on the generation capabilities of long-context LLMs, not just retrieval or reasoning over long inputs. The paper defines a comprehensive evaluation framework with four real-world scenarios and three instruction types, assessing generation up to 32K tokens via CR, STIC-1, and STIC-2 metrics. Through extensive experiments on ten diverse LLMs, it shows that current models struggle to maintain instruction adherence and content coherence over long outputs, with performance deteriorating as length increases. The findings highlight key challenges in long-form generation and motivate future work on model architectures and instruction-tuning data that support sustained, coherent, instruction-following text over extended sequences, with significant implications for design proposals, technical documentation, and creative writing.
Abstract
Current benchmarks like Needle-in-a-Haystack (NIAH), Ruler, and Needlebench focus on models' ability to understand long-context input sequences but fail to capture a critical dimension: the generation of high-quality long-form text. Applications such as design proposals, technical documentation, and creative writing rely on coherent, instruction-following outputs over extended sequences - a challenge that existing benchmarks do not adequately address. To fill this gap, we introduce LongGenBench, a novel benchmark designed to rigorously evaluate large language models' (LLMs) ability to generate long text while adhering to complex instructions. Through tasks requiring specific events or constraints within generated text, LongGenBench evaluates model performance across four distinct scenarios, three instruction types, and two generation-lengths (16K and 32K tokens). Our evaluation of ten state-of-the-art LLMs reveals that, despite strong results on Ruler, all models struggled with long text generation on LongGenBench, particularly as text length increased. This suggests that current LLMs are not yet equipped to meet the demands of real-world, long-form text generation.
