Table of Contents
Fetching ...

FED: Fast and Efficient Dataset Deduplication Framework with GPU Acceleration

Youngjun Son, Chaewon Kim, Jaejin Lee

TL;DR

FED tackles the costly bottleneck of dataset deduplication for large language models by introducing a GPU-accelerated MinHash LSH framework that uses computationally efficient, partly reusable non-cryptographic hash functions. By optimizing GPU kernel performance, I/O parallelism, and a dense pairwise comparison strategy, FED achieves large speedups over CPU baselines (up to 107.2×) and GPU baselines (up to 6.3×) while preserving high deduplication quality (Jaccard ≥ 0.95–0.96 with standard MinHash). The approach enables deduplication of enormous datasets (e.g., 1.2 trillion tokens) in hours on multi-node, multi-GPU clusters, making practical preprocessing feasible for trillion-token-scale training corpora. FED’s design includes hydrogenated hashing, efficient shingling, and a matrix-multiplication–style similarity kernel to maximize GPU utilization, and its code is publicly available for reproducibility and adoption in large-scale data preparation workflows.

Abstract

Dataset deduplication plays a crucial role in enhancing data quality, ultimately improving the training performance and efficiency of large language models. A commonly used method for data deduplication is the MinHash LSH algorithm. Recently, NVIDIA introduced a GPU-based MinHash LSH deduplication method, but it remains suboptimal, leaving room for further improvement in processing efficiency. This paper proposes a GPU-accelerated deduplication framework, FED, that optimizes MinHash LSH for GPU clusters and leverages computationally efficient, partially reusable non-cryptographic hash functions. FED significantly outperforms the CPU-based deduplication tool in SlimPajama (using 64 logical CPU cores) by up to 107.2 times and the GPU-based tool in NVIDIA NeMo Curator by up to 6.3 times when processing 30 million documents on a node with four GPUs. Notably, our method dramatically accelerates the previously time-consuming MinHash signature generation phase, achieving speed-ups of up to 260 compared to the CPU baseline. Despite these gains in efficiency, FED maintains high deduplication quality, with the duplicate document sets reaching a Jaccard similarity of over 0.96 compared to those identified by the standard MinHash algorithm. In large-scale experiments, the deduplication of 1.2 trillion tokens is completed in just 6 hours in a four-node, 16-GPU environment. The related code is publicly available on GitHub (\href{https://github.com/mcrl/FED}{https://github.com/mcrl/FED}).

FED: Fast and Efficient Dataset Deduplication Framework with GPU Acceleration

TL;DR

FED tackles the costly bottleneck of dataset deduplication for large language models by introducing a GPU-accelerated MinHash LSH framework that uses computationally efficient, partly reusable non-cryptographic hash functions. By optimizing GPU kernel performance, I/O parallelism, and a dense pairwise comparison strategy, FED achieves large speedups over CPU baselines (up to 107.2×) and GPU baselines (up to 6.3×) while preserving high deduplication quality (Jaccard ≥ 0.95–0.96 with standard MinHash). The approach enables deduplication of enormous datasets (e.g., 1.2 trillion tokens) in hours on multi-node, multi-GPU clusters, making practical preprocessing feasible for trillion-token-scale training corpora. FED’s design includes hydrogenated hashing, efficient shingling, and a matrix-multiplication–style similarity kernel to maximize GPU utilization, and its code is publicly available for reproducibility and adoption in large-scale data preparation workflows.

Abstract

Dataset deduplication plays a crucial role in enhancing data quality, ultimately improving the training performance and efficiency of large language models. A commonly used method for data deduplication is the MinHash LSH algorithm. Recently, NVIDIA introduced a GPU-based MinHash LSH deduplication method, but it remains suboptimal, leaving room for further improvement in processing efficiency. This paper proposes a GPU-accelerated deduplication framework, FED, that optimizes MinHash LSH for GPU clusters and leverages computationally efficient, partially reusable non-cryptographic hash functions. FED significantly outperforms the CPU-based deduplication tool in SlimPajama (using 64 logical CPU cores) by up to 107.2 times and the GPU-based tool in NVIDIA NeMo Curator by up to 6.3 times when processing 30 million documents on a node with four GPUs. Notably, our method dramatically accelerates the previously time-consuming MinHash signature generation phase, achieving speed-ups of up to 260 compared to the CPU baseline. Despite these gains in efficiency, FED maintains high deduplication quality, with the duplicate document sets reaching a Jaccard similarity of over 0.96 compared to those identified by the standard MinHash algorithm. In large-scale experiments, the deduplication of 1.2 trillion tokens is completed in just 6 hours in a four-node, 16-GPU environment. The related code is publicly available on GitHub (\href{https://github.com/mcrl/FED}{https://github.com/mcrl/FED}).
Paper Structure (39 sections, 6 equations, 5 figures, 6 tables)

This paper contains 39 sections, 6 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The process of MinHash generation.
  • Figure 2: Examples of duplicate documents.
  • Figure 3: Hashing by MinHash LSH.
  • Figure 4: Generating the signature matrix and calculating the bucket ID for each band.
  • Figure 5: Pairwise comparison and the union graph generation.