Table of Contents
Fetching ...

LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability

Zikai Xiao, Fei Huang, Jianhong Tu, Jianhui Wei, Wen Ma, Yuxuan Zhou, Jian Wu, Bowen Yu, Zuozhu Liu, Junyang Lin

TL;DR

LongWeave tackles the challenge of reliably evaluating long-form generation under realistic constraints. It introduces Constraint-Verifier Evaluation (CoV-Eval), a framework that constructs tasks with verifiable targets and associated materials, constraints, and verifiers to enable objective scoring of long outputs. Evaluated over seven tasks and 64K-input/8K-output scales across 23 LLMs, the benchmark reveals substantial degradation in performance as length grows, with reasoning-oriented systems handling longer tasks better but still facing termination and verification issues. By providing a scalable, verifiable diagnostic platform, LongWeave offers a practical path to diagnosing and improving long-form generation and its evaluation in real-world contexts.

Abstract

Generating long, informative, and factual outputs remains a major challenge for Large Language Models (LLMs). Existing benchmarks for long-form generation typically assess real-world queries with hard-to-verify metrics or use synthetic setups that ease evaluation but overlook real-world intricacies. In this paper, we introduce \textbf{LongWeave}, which balances real-world and verifiable assessment with Constraint-Verifier Evaluation (CoV-Eval). CoV-Eval constructs tasks by first defining verifiable targets within real-world scenarios, then systematically generating corresponding queries, textual materials, and constraints based on these targets. This ensures that tasks are both realistic and objectively assessable, enabling rigorous assessment of model capabilities in meeting complex real-world constraints. LongWeave supports customizable input/output lengths (up to 64K/8K tokens) across seven distinct tasks. Evaluation on 23 LLMs shows that even state-of-the-art models encounter significant challenges in long-form generation as real-world complexity and output length increase.

LongWeave: A Long-Form Generation Benchmark Bridging Real-World Relevance and Verifiability

TL;DR

LongWeave tackles the challenge of reliably evaluating long-form generation under realistic constraints. It introduces Constraint-Verifier Evaluation (CoV-Eval), a framework that constructs tasks with verifiable targets and associated materials, constraints, and verifiers to enable objective scoring of long outputs. Evaluated over seven tasks and 64K-input/8K-output scales across 23 LLMs, the benchmark reveals substantial degradation in performance as length grows, with reasoning-oriented systems handling longer tasks better but still facing termination and verification issues. By providing a scalable, verifiable diagnostic platform, LongWeave offers a practical path to diagnosing and improving long-form generation and its evaluation in real-world contexts.

Abstract

Generating long, informative, and factual outputs remains a major challenge for Large Language Models (LLMs). Existing benchmarks for long-form generation typically assess real-world queries with hard-to-verify metrics or use synthetic setups that ease evaluation but overlook real-world intricacies. In this paper, we introduce \textbf{LongWeave}, which balances real-world and verifiable assessment with Constraint-Verifier Evaluation (CoV-Eval). CoV-Eval constructs tasks by first defining verifiable targets within real-world scenarios, then systematically generating corresponding queries, textual materials, and constraints based on these targets. This ensures that tasks are both realistic and objectively assessable, enabling rigorous assessment of model capabilities in meeting complex real-world constraints. LongWeave supports customizable input/output lengths (up to 64K/8K tokens) across seven distinct tasks. Evaluation on 23 LLMs shows that even state-of-the-art models encounter significant challenges in long-form generation as real-world complexity and output length increase.

Paper Structure

This paper contains 27 sections, 4 equations, 25 figures, 9 tables.

Figures (25)

  • Figure 1: The performance across the seven tasks in LongWeave. For better visualization, performance scores have been normalized to a range of 0.3 to 0.7.
  • Figure 2: Three evaluation paradigms for long-form generation. LongWeave is grounded in real-world scenarios and based on objective, verifiable scoring with built-in ground truth, reducing subjectivity and inconsistencies.
  • Figure 3: Illustration of the LongLeave evaluation pipeline. Attribute seeds define task scenarios, and the task generator creates long-form generation tasks paired with constraint–verifier sets. Model outputs are then evaluated by matching against verifiers with length and instruction-following checks.
  • Figure 4: Input length distribution of LongWeave
  • Figure 5: Performance of different model sizes
  • ...and 20 more figures