WIKIGENBENCH: Exploring Full-length Wikipedia Generation under Real-World Scenario
Jiebin Zhang, Eugene J. Yu, Qinyu Chen, Chenhao Xiong, Dawei Zhu, Han Qian, Mingbo Song, Weimin Xiong, Xiaoguang Li, Qun Liu, Sujian Li
TL;DR
The paper tackles the challenge of generating full-length, verifiable Wikipedia articles for newly emerging events under realistic conditions, addressing the limitations of prior work that focuses on snippets or inadequate metrics. It introduces WikiGenBenCh, a 1,320-entry benchmark with web-sourced reference documents and a rigorous evaluation framework across writing, informativeness, and verifiability, enabling systematic comparisons of Retrieval-Augmented Generation methods. Through experiments with RR, PRR, and TunedRR across multiple LLMs and rerankers, the study finds that hierarchical, planning-based generation (PRR) yields more comprehensive content, while fine-tuning improves verifiability; however, even the best methods still lag behind human Wikipedia in overall quality and citation reliability. The work highlights the promise of combining retrieval techniques with LLMs for credible, long-form Wikipedia generation and points to future directions in data quality, advanced retrieval/reranking, and post-citation verification to narrow the performance gap.
Abstract
It presents significant challenges to generate comprehensive and accurate Wikipedia articles for newly emerging events under a real-world scenario. Existing attempts fall short either by focusing only on short snippets or by using metrics that are insufficient to evaluate real-world scenarios. In this paper, we construct WIKIGENBENCH, a new benchmark consisting of 1,320 entries, designed to align with real-world scenarios in both generation and evaluation. For generation, we explore a real-world scenario where structured, full-length Wikipedia articles with citations are generated for new events using input documents from web sources. For evaluation, we integrate systematic metrics and LLM-based metrics to assess the verifiability, organization, and other aspects aligned with real-world scenarios. Based on this benchmark, we conduct extensive experiments using various models within three commonly used frameworks: direct RAG, hierarchical structure-based RAG, and RAG with a fine-tuned generation model. Experimental results show that hierarchical-based methods can generate more comprehensive content, while fine-tuned methods achieve better verifiability. However, even the best methods still show a significant gap compared to existing Wikipedia content, indicating that further research is necessary.
