Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation

Juhwan Choi; Jungmin Yun; Kyohoon Jin; YoungBin Kim

Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation

Juhwan Choi, Jungmin Yun, Kyohoon Jin, YoungBin Kim

TL;DR

This study leverage approaches such as chain-of-thought and majority voting to imitate human annotation and classify unrelated documents from the Multi-News dataset, which is widely used for the multi-document summarization task.

Abstract

The quality of the dataset is crucial for ensuring optimal performance and reliability of downstream task models. However, datasets often contain noisy data inadvertently included during the construction process. Numerous attempts have been made to correct this issue through human annotators. However, hiring and managing human annotators is expensive and time-consuming. As an alternative, recent studies are exploring the use of large language models (LLMs) for data annotation. In this study, we present a case study that extends the application of LLM-based data annotation to enhance the quality of existing datasets through a cleansing strategy. Specifically, we leverage approaches such as chain-of-thought and majority voting to imitate human annotation and classify unrelated documents from the Multi-News dataset, which is widely used for the multi-document summarization task. Through our proposed cleansing method, we introduce an enhanced Multi-News+. By employing LLMs for data cleansing, we demonstrate an efficient and effective approach to improving dataset quality without relying on expensive human annotation efforts.

Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation

TL;DR

Abstract

Paper Structure (18 sections, 3 figures, 5 tables)

This paper contains 18 sections, 3 figures, 5 tables.

Introduction
Related Work
Multi-News+
Experiment
Experimental Design
Result
Discussion and Future Works
Conclusion
Dataset Statistics
Construction Process of Multi-News
Implementation Details
Manual Analysis
Additional Experiment with Large Language Models
Analysis of Multi-News
Examples of Noisy Documents
...and 3 more sections

Figures (3)

Figure 1: Overall framework for cleansing data and composing Multi-News+.
Figure 2: Histogram comparing the amount of input articles in each dataset.
Figure 3: A screenshot of a webpage that is relevant to the article in Appendix \ref{['sec:appendix-extreme']}. Multi-News includes the text in the red box instead of the desired content in the blue box.

Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation

TL;DR

Abstract

Multi-News+: Cost-efficient Dataset Cleansing via LLM-based Data Annotation

Authors

TL;DR

Abstract

Table of Contents

Figures (3)