Table of Contents
Fetching ...

A Review of the Challenges with Massive Web-mined Corpora Used in Large Language Models Pre-Training

Michał Perełkiewicz, Rafał Poświata

TL;DR

The paper addresses the challenges of using massive web-mined corpora for pre-training large language models. It surveys definitions, scale, and widely used corpora to ground the discussion, and analyzes data quality and noise, bias and representativeness, ethical considerations, duplication, low-resource languages, and benchmark data contamination, while discussing current mitigation approaches and gaps. The findings highlight substantial issues in quality, representation, and safety that can affect model performance and societal impact. The authors advocate for integrated data-cleaning pipelines, ethical data-use practices, bias-aware mitigation, and the development of robust benchmarks to enable safer, more reliable LLMs.

Abstract

This article presents a comprehensive review of the challenges associated with using massive web-mined corpora for the pre-training of large language models (LLMs). This review identifies key challenges in this domain, including challenges such as noise (irrelevant or misleading information), duplication of content, the presence of low-quality or incorrect information, biases, and the inclusion of sensitive or personal information in web-mined corpora. Addressing these issues is crucial for the development of accurate, reliable, and ethically responsible language models. Through an examination of current methodologies for data cleaning, pre-processing, bias detection and mitigation, we highlight the gaps in existing approaches and suggest directions for future research. Our discussion aims to catalyze advancements in developing more sophisticated and ethically responsible LLMs.

A Review of the Challenges with Massive Web-mined Corpora Used in Large Language Models Pre-Training

TL;DR

The paper addresses the challenges of using massive web-mined corpora for pre-training large language models. It surveys definitions, scale, and widely used corpora to ground the discussion, and analyzes data quality and noise, bias and representativeness, ethical considerations, duplication, low-resource languages, and benchmark data contamination, while discussing current mitigation approaches and gaps. The findings highlight substantial issues in quality, representation, and safety that can affect model performance and societal impact. The authors advocate for integrated data-cleaning pipelines, ethical data-use practices, bias-aware mitigation, and the development of robust benchmarks to enable safer, more reliable LLMs.

Abstract

This article presents a comprehensive review of the challenges associated with using massive web-mined corpora for the pre-training of large language models (LLMs). This review identifies key challenges in this domain, including challenges such as noise (irrelevant or misleading information), duplication of content, the presence of low-quality or incorrect information, biases, and the inclusion of sensitive or personal information in web-mined corpora. Addressing these issues is crucial for the development of accurate, reliable, and ethically responsible language models. Through an examination of current methodologies for data cleaning, pre-processing, bias detection and mitigation, we highlight the gaps in existing approaches and suggest directions for future research. Our discussion aims to catalyze advancements in developing more sophisticated and ethically responsible LLMs.
Paper Structure (13 sections)