Table of Contents
Fetching ...

REVERSUM: A Multi-staged Retrieval-Augmented Generation Method to Enhance Wikipedia Tail Biographies through Personal Narratives

Sayantan Adak, Pauras Mangesh Meher, Paramita Das, Animesh Mukherjee

TL;DR

This work tackles the limited coverage of Wikipedia's tail biographies by leveraging personal narratives (autobiographies and biographies) within a multi-staged retrieval-augmented generation framework (REVerSum). It introduces a data collection pipeline of 102 narratives and a four-stage generation process (relevance detection, evidence collection, verification, and summarization) grounded in verified sources, achieving superior integrability and informativeness over a standard RAG baseline on both B and C class biographies. The approach is validated through automatic metrics and crowd-based evaluations, with REVerSum demonstrating statistically significant gains (p<0.05) for many measures and high human judgments (92% integrable, 96% informative, 98% understandable, 99% readable) as well as a GPT-4 faithfulness score of 0.95. The paper also discusses generalization potential, limitations, and ethical considerations for incorporating personal narratives into knowledge bases, highlighting robustness against hallucinations via evidence verification and grounding.

Abstract

Wikipedia is an invaluable resource for factual information about a wide range of entities. However, the quality of articles on less-known entities often lags behind that of the well-known ones. This study proposes a novel approach to enhancing Wikipedia's B and C category biography articles by leveraging personal narratives such as autobiographies and biographies. By utilizing a multi-staged retrieval-augmented generation technique -- REVerSum -- we aim to enrich the informational content of these lesser-known articles. Our study reveals that personal narratives can significantly improve the quality of Wikipedia articles, providing a rich source of reliable information that has been underutilized in previous studies. Based on crowd-based evaluation, REVerSum generated content outperforms the best performing baseline by 17% in terms of integrability to the original Wikipedia article and 28.5\% in terms of informativeness. Code and Data are available at: https://github.com/sayantan11995/wikipedia_enrichment

REVERSUM: A Multi-staged Retrieval-Augmented Generation Method to Enhance Wikipedia Tail Biographies through Personal Narratives

TL;DR

This work tackles the limited coverage of Wikipedia's tail biographies by leveraging personal narratives (autobiographies and biographies) within a multi-staged retrieval-augmented generation framework (REVerSum). It introduces a data collection pipeline of 102 narratives and a four-stage generation process (relevance detection, evidence collection, verification, and summarization) grounded in verified sources, achieving superior integrability and informativeness over a standard RAG baseline on both B and C class biographies. The approach is validated through automatic metrics and crowd-based evaluations, with REVerSum demonstrating statistically significant gains (p<0.05) for many measures and high human judgments (92% integrable, 96% informative, 98% understandable, 99% readable) as well as a GPT-4 faithfulness score of 0.95. The paper also discusses generalization potential, limitations, and ethical considerations for incorporating personal narratives into knowledge bases, highlighting robustness against hallucinations via evidence verification and grounding.

Abstract

Wikipedia is an invaluable resource for factual information about a wide range of entities. However, the quality of articles on less-known entities often lags behind that of the well-known ones. This study proposes a novel approach to enhancing Wikipedia's B and C category biography articles by leveraging personal narratives such as autobiographies and biographies. By utilizing a multi-staged retrieval-augmented generation technique -- REVerSum -- we aim to enrich the informational content of these lesser-known articles. Our study reveals that personal narratives can significantly improve the quality of Wikipedia articles, providing a rich source of reliable information that has been underutilized in previous studies. Based on crowd-based evaluation, REVerSum generated content outperforms the best performing baseline by 17% in terms of integrability to the original Wikipedia article and 28.5\% in terms of informativeness. Code and Data are available at: https://github.com/sayantan11995/wikipedia_enrichment

Paper Structure

This paper contains 37 sections, 1 equation, 5 figures, 15 tables.

Figures (5)

  • Figure 1: Overview of Wikipedia section enhancement from personal narratives.
  • Figure 2: A schematic of REVerSum. LLMs in the same block represents that they are in same chat session.
  • Figure 3: Relevance of different portions of the personal narratives with respect to the Wikipedia section.
  • Figure 4: Interface for the annotation task instruction
  • Figure 5: Representative example of an annotation task