Table of Contents
Fetching ...

LongEval: A Comprehensive Analysis of Long-Text Generation Through a Plan-based Paradigm

Siwei Wu, Yizhi Li, Xingwei Qu, Rishi Ravikumar, Yucheng Li, Tyler Loakman, Shanghaoran Quan, Xiaoyong Wei, Riza Batista-Navarro, Chenghua Lin

TL;DR

LongEval tackles the underexplored problem of evaluating long-form text generated by LLMs by introducing a dual-paradigm benchmark that combines direct and plan-based generation. It presents eight document- and section-level metrics across three domains (arXiv, blogs, Wikipedia) and demonstrates that plan-based generation yields higher-quality long-form outputs than direct generation, especially in maintaining coherence and following content plans. A content-plan generation pipeline using Qwen2.5-72B-Instruct, along with human validation and an information-compression metric $\text{ICR}$, enables robust evaluation with approximately 150–166 long texts exceeding 2K words. The findings show that model size helps but does not solve the challenges of length control and high-level reasoning, highlighting the practical value of planning-based generation for future long-text systems.

Abstract

Large Language Models (LLMs) have achieved remarkable success in various natural language processing tasks, yet their ability to generate long-form content remains poorly understood and evaluated. Our analysis reveals that current LLMs struggle with length requirements and information density in long-text generation, with performance deteriorating as text length increases. To quantitively locate such a performance degradation and provide further insights on model development, we present LongEval, a benchmark that evaluates long-text generation through both direct and plan-based generation paradigms, inspired by cognitive and linguistic writing models. The comprehensive experiments in this work reveal interesting findings such as that while model size correlates with generation ability, the small-scale model (e.g., LongWriter), well-trained on long texts, has comparable performance. All code and datasets are released in https://github.com/Wusiwei0410/LongEval.

LongEval: A Comprehensive Analysis of Long-Text Generation Through a Plan-based Paradigm

TL;DR

LongEval tackles the underexplored problem of evaluating long-form text generated by LLMs by introducing a dual-paradigm benchmark that combines direct and plan-based generation. It presents eight document- and section-level metrics across three domains (arXiv, blogs, Wikipedia) and demonstrates that plan-based generation yields higher-quality long-form outputs than direct generation, especially in maintaining coherence and following content plans. A content-plan generation pipeline using Qwen2.5-72B-Instruct, along with human validation and an information-compression metric , enables robust evaluation with approximately 150–166 long texts exceeding 2K words. The findings show that model size helps but does not solve the challenges of length control and high-level reasoning, highlighting the practical value of planning-based generation for future long-text systems.

Abstract

Large Language Models (LLMs) have achieved remarkable success in various natural language processing tasks, yet their ability to generate long-form content remains poorly understood and evaluated. Our analysis reveals that current LLMs struggle with length requirements and information density in long-text generation, with performance deteriorating as text length increases. To quantitively locate such a performance degradation and provide further insights on model development, we present LongEval, a benchmark that evaluates long-text generation through both direct and plan-based generation paradigms, inspired by cognitive and linguistic writing models. The comprehensive experiments in this work reveal interesting findings such as that while model size correlates with generation ability, the small-scale model (e.g., LongWriter), well-trained on long texts, has comparable performance. All code and datasets are released in https://github.com/Wusiwei0410/LongEval.

Paper Structure

This paper contains 40 sections, 6 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: The information content of LLMs-generated text and the golden human-authored text. We calculate information entropy using the frequency of each word in a document and determine the information content by multiplying the total word count by information entropy.
  • Figure 2: Th relation of the length requirement with the model-generated text length. Given the content plans, we require the LLMs to generate the text under various length requirements ranging from 100 to 32k. Specifically, we use the ratio of the generated text length to the requested length in the input as a score to evaluate the model's ability to follow length instructions.
  • Figure 3: The Framework of our Long Text Generation method. Part (a) is the Plan-based method and part (b) is the Long Text Evaluation method.
  • Figure 4: The table presents the prompts for the metrics that use LLMs to evaluate long text under different dimensions.
  • Figure 5: A section generated by Qwen2.5-72B.
  • ...and 1 more figures