LongEval: A Comprehensive Analysis of Long-Text Generation Through a Plan-based Paradigm

Siwei Wu; Yizhi Li; Xingwei Qu; Rishi Ravikumar; Yucheng Li; Tyler Loakman; Shanghaoran Quan; Xiaoyong Wei; Riza Batista-Navarro; Chenghua Lin

LongEval: A Comprehensive Analysis of Long-Text Generation Through a Plan-based Paradigm

Siwei Wu, Yizhi Li, Xingwei Qu, Rishi Ravikumar, Yucheng Li, Tyler Loakman, Shanghaoran Quan, Xiaoyong Wei, Riza Batista-Navarro, Chenghua Lin

TL;DR

LongEval tackles the underexplored problem of evaluating long-form text generated by LLMs by introducing a dual-paradigm benchmark that combines direct and plan-based generation. It presents eight document- and section-level metrics across three domains (arXiv, blogs, Wikipedia) and demonstrates that plan-based generation yields higher-quality long-form outputs than direct generation, especially in maintaining coherence and following content plans. A content-plan generation pipeline using Qwen2.5-72B-Instruct, along with human validation and an information-compression metric $\text{ICR}$, enables robust evaluation with approximately 150–166 long texts exceeding 2K words. The findings show that model size helps but does not solve the challenges of length control and high-level reasoning, highlighting the practical value of planning-based generation for future long-text systems.

Abstract

Large Language Models (LLMs) have achieved remarkable success in various natural language processing tasks, yet their ability to generate long-form content remains poorly understood and evaluated. Our analysis reveals that current LLMs struggle with length requirements and information density in long-text generation, with performance deteriorating as text length increases. To quantitively locate such a performance degradation and provide further insights on model development, we present LongEval, a benchmark that evaluates long-text generation through both direct and plan-based generation paradigms, inspired by cognitive and linguistic writing models. The comprehensive experiments in this work reveal interesting findings such as that while model size correlates with generation ability, the small-scale model (e.g., LongWriter), well-trained on long texts, has comparable performance. All code and datasets are released in https://github.com/Wusiwei0410/LongEval.

LongEval: A Comprehensive Analysis of Long-Text Generation Through a Plan-based Paradigm

TL;DR

Abstract

LongEval: A Comprehensive Analysis of Long-Text Generation Through a Plan-based Paradigm

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)