Table of Contents
Fetching ...

WIKIGENBENCH: Exploring Full-length Wikipedia Generation under Real-World Scenario

Jiebin Zhang, Eugene J. Yu, Qinyu Chen, Chenhao Xiong, Dawei Zhu, Han Qian, Mingbo Song, Weimin Xiong, Xiaoguang Li, Qun Liu, Sujian Li

TL;DR

The paper tackles the challenge of generating full-length, verifiable Wikipedia articles for newly emerging events under realistic conditions, addressing the limitations of prior work that focuses on snippets or inadequate metrics. It introduces WikiGenBenCh, a 1,320-entry benchmark with web-sourced reference documents and a rigorous evaluation framework across writing, informativeness, and verifiability, enabling systematic comparisons of Retrieval-Augmented Generation methods. Through experiments with RR, PRR, and TunedRR across multiple LLMs and rerankers, the study finds that hierarchical, planning-based generation (PRR) yields more comprehensive content, while fine-tuning improves verifiability; however, even the best methods still lag behind human Wikipedia in overall quality and citation reliability. The work highlights the promise of combining retrieval techniques with LLMs for credible, long-form Wikipedia generation and points to future directions in data quality, advanced retrieval/reranking, and post-citation verification to narrow the performance gap.

Abstract

It presents significant challenges to generate comprehensive and accurate Wikipedia articles for newly emerging events under a real-world scenario. Existing attempts fall short either by focusing only on short snippets or by using metrics that are insufficient to evaluate real-world scenarios. In this paper, we construct WIKIGENBENCH, a new benchmark consisting of 1,320 entries, designed to align with real-world scenarios in both generation and evaluation. For generation, we explore a real-world scenario where structured, full-length Wikipedia articles with citations are generated for new events using input documents from web sources. For evaluation, we integrate systematic metrics and LLM-based metrics to assess the verifiability, organization, and other aspects aligned with real-world scenarios. Based on this benchmark, we conduct extensive experiments using various models within three commonly used frameworks: direct RAG, hierarchical structure-based RAG, and RAG with a fine-tuned generation model. Experimental results show that hierarchical-based methods can generate more comprehensive content, while fine-tuned methods achieve better verifiability. However, even the best methods still show a significant gap compared to existing Wikipedia content, indicating that further research is necessary.

WIKIGENBENCH: Exploring Full-length Wikipedia Generation under Real-World Scenario

TL;DR

The paper tackles the challenge of generating full-length, verifiable Wikipedia articles for newly emerging events under realistic conditions, addressing the limitations of prior work that focuses on snippets or inadequate metrics. It introduces WikiGenBenCh, a 1,320-entry benchmark with web-sourced reference documents and a rigorous evaluation framework across writing, informativeness, and verifiability, enabling systematic comparisons of Retrieval-Augmented Generation methods. Through experiments with RR, PRR, and TunedRR across multiple LLMs and rerankers, the study finds that hierarchical, planning-based generation (PRR) yields more comprehensive content, while fine-tuning improves verifiability; however, even the best methods still lag behind human Wikipedia in overall quality and citation reliability. The work highlights the promise of combining retrieval techniques with LLMs for credible, long-form Wikipedia generation and points to future directions in data quality, advanced retrieval/reranking, and post-citation verification to narrow the performance gap.

Abstract

It presents significant challenges to generate comprehensive and accurate Wikipedia articles for newly emerging events under a real-world scenario. Existing attempts fall short either by focusing only on short snippets or by using metrics that are insufficient to evaluate real-world scenarios. In this paper, we construct WIKIGENBENCH, a new benchmark consisting of 1,320 entries, designed to align with real-world scenarios in both generation and evaluation. For generation, we explore a real-world scenario where structured, full-length Wikipedia articles with citations are generated for new events using input documents from web sources. For evaluation, we integrate systematic metrics and LLM-based metrics to assess the verifiability, organization, and other aspects aligned with real-world scenarios. Based on this benchmark, we conduct extensive experiments using various models within three commonly used frameworks: direct RAG, hierarchical structure-based RAG, and RAG with a fine-tuned generation model. Experimental results show that hierarchical-based methods can generate more comprehensive content, while fine-tuned methods achieve better verifiability. However, even the best methods still show a significant gap compared to existing Wikipedia content, indicating that further research is necessary.
Paper Structure (28 sections, 4 equations, 2 figures, 10 tables)

This paper contains 28 sections, 4 equations, 2 figures, 10 tables.

Figures (2)

  • Figure 1: Illustration of the proposed Wikipedia generation task.
  • Figure 2: We fine-tuned models of various scales and families, evaluating checkpoints at 1, 5, and 10 epochs. We selected one primary metric from each of the three dimensions and displayed their performance trends with training epochs.