Table of Contents
Fetching ...

Beyond Line-Level Filtering for the Pretraining Corpora of LLMs

Chanwoo Park, Suyoung Park, Yelim Ahn, Jongmin Kim, Jongyeon Park, Jaejin Lee

TL;DR

This work tackles the drawback of traditional line-level filtering in open web data for pretraining LLMs, where simple rules can discard content that improves downstream tasks. It introduces pattern-aware line filtering (PLD and PTF), which operate on document-level signal patterns to preserve structurally informative content while removing boilerplate. Across English and Korean 1B-scale decoders trained on CommonCrawl WET data, PLD and PTF yield improvements on multiple-choice benchmarks and generative QA benchmarks like SQuAD v1 and KorQuAD v1, though PTF can trade off generative QA gains for MC performance. The study provides a fully reproducible, language-tuned filtering pipeline, highlights cross-language differences, and suggests broader applicability to multilingual corpus construction and continual pretraining pipelines.

Abstract

While traditional line-level filtering techniques, such as line-level deduplication and trailing-punctuation filters, are commonly used, these basic methods can sometimes discard valuable content, negatively affecting downstream performance. In this paper, we introduce two methods-pattern-aware line-level deduplication (PLD) and pattern-aware trailing punctuation filtering (PTF)-by enhancing the conventional filtering techniques. Our approach not only considers line-level signals but also takes into account their sequential distribution across documents, enabling us to retain structurally important content that might otherwise be removed. We evaluate these proposed methods by training small language models (1 B parameters) in both English and Korean. The results demonstrate that our methods consistently improve performance on multiple-choice benchmarks and significantly enhance generative question-answering accuracy on both SQuAD v1 and KorQuAD v1.

Beyond Line-Level Filtering for the Pretraining Corpora of LLMs

TL;DR

This work tackles the drawback of traditional line-level filtering in open web data for pretraining LLMs, where simple rules can discard content that improves downstream tasks. It introduces pattern-aware line filtering (PLD and PTF), which operate on document-level signal patterns to preserve structurally informative content while removing boilerplate. Across English and Korean 1B-scale decoders trained on CommonCrawl WET data, PLD and PTF yield improvements on multiple-choice benchmarks and generative QA benchmarks like SQuAD v1 and KorQuAD v1, though PTF can trade off generative QA gains for MC performance. The study provides a fully reproducible, language-tuned filtering pipeline, highlights cross-language differences, and suggests broader applicability to multilingual corpus construction and continual pretraining pipelines.

Abstract

While traditional line-level filtering techniques, such as line-level deduplication and trailing-punctuation filters, are commonly used, these basic methods can sometimes discard valuable content, negatively affecting downstream performance. In this paper, we introduce two methods-pattern-aware line-level deduplication (PLD) and pattern-aware trailing punctuation filtering (PTF)-by enhancing the conventional filtering techniques. Our approach not only considers line-level signals but also takes into account their sequential distribution across documents, enabling us to retain structurally important content that might otherwise be removed. We evaluate these proposed methods by training small language models (1 B parameters) in both English and Korean. The results demonstrate that our methods consistently improve performance on multiple-choice benchmarks and significantly enhance generative question-answering accuracy on both SQuAD v1 and KorQuAD v1.

Paper Structure

This paper contains 80 sections, 2 figures, 33 tables.

Figures (2)

  • Figure 1: Cumulative distribution of frequency count of text lines (English documents).
  • Figure 2: Cumulative distribution of frequency count of text lines (Korean documents).