A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication

Marcel Gregoriadis; Leonhard Balduf; Björn Scheuermann; Johan Pouwelse

A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication

Marcel Gregoriadis, Leonhard Balduf, Björn Scheuermann, Johan Pouwelse

TL;DR

A rigorous theoretical analysis and impartial experimental comparison of several leading CDC algorithms against four key metrics, providing valuable insights that inform the selection and optimization of CDC algorithms for practical applications in data deduplication.

Abstract

Data deduplication emerged as a powerful solution for reducing storage and bandwidth costs in cloud settings by eliminating redundancies at the level of chunks. This has spurred the development of numerous Content-Defined Chunking (CDC) algorithms over the past two decades. Despite advancements, the current state-of-the-art remains obscure, as a thorough and impartial analysis and comparison is lacking. We conduct a rigorous theoretical analysis and impartial experimental comparison of several leading CDC algorithms. Using four realistic datasets, we evaluate these algorithms against four key metrics: throughput, deduplication ratio, average chunk size, and chunk-size variance. Our analyses, in many instances, extend the findings of their original publications by reporting new results and putting existing ones into context. Moreover, we highlight limitations that have previously gone unnoticed. Our findings provide valuable insights that inform the selection and optimization of CDC algorithms for practical applications in data deduplication.

A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication

TL;DR

Abstract

Paper Structure (51 sections, 3 equations, 11 figures, 10 tables, 7 algorithms)

This paper contains 51 sections, 3 equations, 11 figures, 10 tables, 7 algorithms.

Introduction
Background
Inception of CDC
Chunk-Size Variance
Modern CDC Algorithms
Related Work
Chunking Algorithms
Basic Sliding Window (BSW)
Asymmetric Extremum (AE)
Rapid Asymmetric Extremum (RAM)
Minimal Incremental Interval (MII)
Parity Check of Interval (PCI)
Bytes-Frequency--Based Chunking (BFBC)
Determining BFBC* Divisors
Experiment Setup
...and 36 more sections

Figures (11)

Figure 1: Boundary-shift problem.
Figure 2: Window sliding over a file byte-by-byte.
Figure 3: Taxonomy of evaluated chunking algorithms.
Figure 4: Illustration of the AE chunking algorithm on a sequence of 27 bytes and a horizon $h=4$. The vertical red lines mark the cut points which then determine the resulting chunks $c_i$.
Figure 5: CDC throughput, median values and quartiles, $\mu=\qty{2}{\kibi\byte}$, RAND dataset. BFBC-L indicates BFBC on the CODE dataset.
...and 6 more figures

A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication

TL;DR

Abstract

A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication

Authors

TL;DR

Abstract

Table of Contents

Figures (11)