Rethinking KenLM: Good and Bad Model Ensembles for Efficient Text Quality Filtering in Large Web Corpora

Yungi Kim; Hyunsoo Ha; Sukyung Lee; Jihoo Kim; Seonghoon Yang; Chanjun Park

Rethinking KenLM: Good and Bad Model Ensembles for Efficient Text Quality Filtering in Large Web Corpora

Yungi Kim, Hyunsoo Ha, Sukyung Lee, Jihoo Kim, Seonghoon Yang, Chanjun Park

TL;DR

Experimental results demonstrate that the proposed ensemble approach significantly reduces noisy content while preserving high-quality content compared to the traditional KenLM training method, indicating that the method can be a practical solution with minimal computational overhead for resource-constrained environments.

Abstract

With the increasing demand for substantial amounts of high-quality data to train large language models (LLMs), efficiently filtering large web corpora has become a critical challenge. For this purpose, KenLM, a lightweight n-gram-based language model that operates on CPUs, is widely used. However, the traditional method of training KenLM utilizes only high-quality data and, consequently, does not explicitly learn the linguistic patterns of low-quality data. To address this issue, we propose an ensemble approach that leverages two contrasting KenLMs: (i) Good KenLM, trained on high-quality data; and (ii) Bad KenLM, trained on low-quality data. Experimental results demonstrate that our approach significantly reduces noisy content while preserving high-quality content compared to the traditional KenLM training method. This indicates that our method can be a practical solution with minimal computational overhead for resource-constrained environments.

Rethinking KenLM: Good and Bad Model Ensembles for Efficient Text Quality Filtering in Large Web Corpora

TL;DR

Abstract

Paper Structure (19 sections, 1 equation, 2 figures, 3 tables)

This paper contains 19 sections, 1 equation, 2 figures, 3 tables.

Introduction
Related Work
Perplexity-based filtering.
Classifier-based filtering.
Proposed Method
Good KenLM.
Bad KenLM.
Ensemble.
Experiments
Experimental Settings
Dataset and model details.
Evaluation details.
Main Results
RQ1: Comparison of existing models.
RQ2: Impact of data sources on training Bad KenLM.
...and 4 more sections

Figures (2)

Figure 1: The effect of $\alpha$ on the performance of our ensemble approach.
Figure 2: Visualization of examples that are not filtered by Good KenLM but are successfully removed by our ensemble approach.

Rethinking KenLM: Good and Bad Model Ensembles for Efficient Text Quality Filtering in Large Web Corpora

TL;DR

Abstract

Rethinking KenLM: Good and Bad Model Ensembles for Efficient Text Quality Filtering in Large Web Corpora

Authors

TL;DR

Abstract

Table of Contents

Figures (2)