Table of Contents
Fetching ...

LSHBloom: Memory-efficient, Extreme-scale Document Deduplication

Arham Khan, Robert Underwood, Carlo Siebenschuh, Yadu Babuji, Aswathy Ajith, Kyle Hippe, Ozan Gokdemir, Alexander Brace, Kyle Chard, Ian Foster

TL;DR

This work tackles the challenge of memory- and compute-efficient deduplication for internet-scale text datasets used in LLM pre-training. It introduces LSHBloom, a Bloom-filter-based extension of MinHashLSH that replaces the heavy LSHIndex with lightweight Bloom filters, preserving deduplication quality while achieving order-of-magnitude gains in speed and space. Through thorough benchmarking against state-of-the-art methods, LSHBloom demonstrates comparable F1 to MinHashLSH with dramatically reduced disk usage (18x less) and significantly faster runtimes (up to 12x), enabling scalable deduplication to several billions of documents. The approach offers a practical, drop-in replacement for MinHashLSH that unlocks scalable, high-quality deduplication for massive web-scale text datasets, with future work focusing on parallelization to further accelerate processing.

Abstract

Contemporary large language model (LLM) training pipelines require the assembly of internet-scale databases full of text data from a variety of sources (e.g., web, academic, and publishers). Preprocessing these datasets via deduplication -- detecting and eliminating additional instances of the same content -- is a major focus for assembling and curating training datasets for LLMs. Unrestrained, duplicates in the training dataset increase training costs and lead to undesirable properties such as memorization in trained models or cheating on evaluation. Unfortunately, contemporary approaches to document-level deduplication are either unreliable at accurately identifying duplicate documents or extremely expensive in terms of both runtime and memory. We propose LSHBloom, an extension to MinhashLSH, which replaces the expensive LSHIndex with lightweight Bloom filters. LSHBloom demonstrates the same state-of-the-art deduplication performance as MinhashLSH, with only a marginal increase in false positives (near zero in our experiments), while boasting competitive runtime (12$\times$ faster than MinhashLSH on peS2o) and, crucially, using 18$\times$ less disk space than MinhashLSH (as measured on peS2o). Based on extrapolation, we show that this advantage in space and runtime remains even at the extreme scale of several billion documents. LSHBloom allows practitioners to access the deduplication quality of MinHashLSH at scales that are normally only tractable for less sophisticated, heuristic solutions. As a result, LSHBloom promises to enable scaling high-quality document deduplication to internet-scale text datasets.

LSHBloom: Memory-efficient, Extreme-scale Document Deduplication

TL;DR

This work tackles the challenge of memory- and compute-efficient deduplication for internet-scale text datasets used in LLM pre-training. It introduces LSHBloom, a Bloom-filter-based extension of MinHashLSH that replaces the heavy LSHIndex with lightweight Bloom filters, preserving deduplication quality while achieving order-of-magnitude gains in speed and space. Through thorough benchmarking against state-of-the-art methods, LSHBloom demonstrates comparable F1 to MinHashLSH with dramatically reduced disk usage (18x less) and significantly faster runtimes (up to 12x), enabling scalable deduplication to several billions of documents. The approach offers a practical, drop-in replacement for MinHashLSH that unlocks scalable, high-quality deduplication for massive web-scale text datasets, with future work focusing on parallelization to further accelerate processing.

Abstract

Contemporary large language model (LLM) training pipelines require the assembly of internet-scale databases full of text data from a variety of sources (e.g., web, academic, and publishers). Preprocessing these datasets via deduplication -- detecting and eliminating additional instances of the same content -- is a major focus for assembling and curating training datasets for LLMs. Unrestrained, duplicates in the training dataset increase training costs and lead to undesirable properties such as memorization in trained models or cheating on evaluation. Unfortunately, contemporary approaches to document-level deduplication are either unreliable at accurately identifying duplicate documents or extremely expensive in terms of both runtime and memory. We propose LSHBloom, an extension to MinhashLSH, which replaces the expensive LSHIndex with lightweight Bloom filters. LSHBloom demonstrates the same state-of-the-art deduplication performance as MinhashLSH, with only a marginal increase in false positives (near zero in our experiments), while boasting competitive runtime (12 faster than MinhashLSH on peS2o) and, crucially, using 18 less disk space than MinhashLSH (as measured on peS2o). Based on extrapolation, we show that this advantage in space and runtime remains even at the extreme scale of several billion documents. LSHBloom allows practitioners to access the deduplication quality of MinHashLSH at scales that are normally only tractable for less sophisticated, heuristic solutions. As a result, LSHBloom promises to enable scaling high-quality document deduplication to internet-scale text datasets.

Paper Structure

This paper contains 31 sections, 6 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Breakdown of wall clock time on 10% of peS2o for the conventional MinHashLSH algorithm and our LSHBloom method
  • Figure 2: Value of $p_{effective}$ (the effective false positive rate overhead) in our correctness benchmark at various parameter settings given $N$=24,956 documents and $p$=1e-5
  • Figure 3: F1 score for LSH techniques as a function of the number of permutations (x-axis) and Jaccard similarity threshold (y axis).
  • Figure 4: F1 Score for N-Gram techniques as a function of N-Gram size (x axis) and overlap threshold (y axis).
  • Figure 5: F1 score vs. threshold for paragraph-level techniques.
  • ...and 5 more figures