Table of Contents
Fetching ...

Vectorized Sequence-Based Chunking for Data Deduplication

Sreeharsha Udayashankar, Samer Al-Kiswany

TL;DR

SeqCDC tackles the throughput bottlenecks of content-defined chunking by introducing lightweight boundary judgment based on monotonically increasing/decreasing sequences, content-based data skipping to avoid unnecessary scans, and SIMD acceleration (SSE/AVX) to speed boundary detection and skipping. The approach scales with larger chunk sizes, which are commonly preferred in deduplication to reduce fingerprinting overhead, and maintains comparable space savings to existing CDC methods. Empirical evaluation shows SeqCDC outperforms unaccelerated CDC by about 10× and outperforms vector-accelerated CDCs by around 1.25×–1.35×, while remaining compatible with AVX-256 and SSE-128 on common CPUs. The work demonstrates that combining monotonic-sequence boundary detection, controlled data skipping, and vectorization yields substantial throughput gains without sacrificing deduplication effectiveness, suggesting a path toward practical, high-throughput deduplication for large-scale storage systems.

Abstract

Data deduplication has gained wide acclaim as a mechanism to improve storage efficiency and conserve network bandwidth. Its most critical phase, data chunking, is responsible for the overall space savings achieved via the deduplication process. However, modern data chunking algorithms are slow and compute-intensive because they scan large amounts of data while simultaneously making data-driven boundary decisions. We present SeqCDC, a novel chunking algorithm that leverages lightweight boundary detection, content-defined skipping, and SSE/AVX acceleration to improve chunking throughput for large chunk sizes. Our evaluation shows that SeqCDC achieves 15x higher throughput than unaccelerated and 1.2x-1.35x higher throughput than vector-accelerated data chunking algorithms while minimally affecting deduplication space savings.

Vectorized Sequence-Based Chunking for Data Deduplication

TL;DR

SeqCDC tackles the throughput bottlenecks of content-defined chunking by introducing lightweight boundary judgment based on monotonically increasing/decreasing sequences, content-based data skipping to avoid unnecessary scans, and SIMD acceleration (SSE/AVX) to speed boundary detection and skipping. The approach scales with larger chunk sizes, which are commonly preferred in deduplication to reduce fingerprinting overhead, and maintains comparable space savings to existing CDC methods. Empirical evaluation shows SeqCDC outperforms unaccelerated CDC by about 10× and outperforms vector-accelerated CDCs by around 1.25×–1.35×, while remaining compatible with AVX-256 and SSE-128 on common CPUs. The work demonstrates that combining monotonic-sequence boundary detection, controlled data skipping, and vectorization yields substantial throughput gains without sacrificing deduplication effectiveness, suggesting a path toward practical, high-throughput deduplication for large-scale storage systems.

Abstract

Data deduplication has gained wide acclaim as a mechanism to improve storage efficiency and conserve network bandwidth. Its most critical phase, data chunking, is responsible for the overall space savings achieved via the deduplication process. However, modern data chunking algorithms are slow and compute-intensive because they scan large amounts of data while simultaneously making data-driven boundary decisions. We present SeqCDC, a novel chunking algorithm that leverages lightweight boundary detection, content-defined skipping, and SSE/AVX acceleration to improve chunking throughput for large chunk sizes. Our evaluation shows that SeqCDC achieves 15x higher throughput than unaccelerated and 1.2x-1.35x higher throughput than vector-accelerated data chunking algorithms while minimally affecting deduplication space savings.

Paper Structure

This paper contains 20 sections, 1 equation, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Chunking throughput on randomized data
  • Figure 2: An example of a chunk generated by SeqCDC
  • Figure 3: Accelerating SeqCDC with AVX-512 instructions
  • Figure 4: Handling byte-shifts with SeqCDC
  • Figure 5: Space Savings with 8KB chunks. Note that SEQ is unaccelerated SeqCDC.
  • ...and 7 more figures