Table of Contents
Fetching ...

A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication

Marcel Gregoriadis, Leonhard Balduf, Björn Scheuermann, Johan Pouwelse

TL;DR

A rigorous theoretical analysis and impartial experimental comparison of several leading CDC algorithms against four key metrics, providing valuable insights that inform the selection and optimization of CDC algorithms for practical applications in data deduplication.

Abstract

Data deduplication emerged as a powerful solution for reducing storage and bandwidth costs in cloud settings by eliminating redundancies at the level of chunks. This has spurred the development of numerous Content-Defined Chunking (CDC) algorithms over the past two decades. Despite advancements, the current state-of-the-art remains obscure, as a thorough and impartial analysis and comparison is lacking. We conduct a rigorous theoretical analysis and impartial experimental comparison of several leading CDC algorithms. Using four realistic datasets, we evaluate these algorithms against four key metrics: throughput, deduplication ratio, average chunk size, and chunk-size variance. Our analyses, in many instances, extend the findings of their original publications by reporting new results and putting existing ones into context. Moreover, we highlight limitations that have previously gone unnoticed. Our findings provide valuable insights that inform the selection and optimization of CDC algorithms for practical applications in data deduplication.

A Thorough Investigation of Content-Defined Chunking Algorithms for Data Deduplication

TL;DR

A rigorous theoretical analysis and impartial experimental comparison of several leading CDC algorithms against four key metrics, providing valuable insights that inform the selection and optimization of CDC algorithms for practical applications in data deduplication.

Abstract

Data deduplication emerged as a powerful solution for reducing storage and bandwidth costs in cloud settings by eliminating redundancies at the level of chunks. This has spurred the development of numerous Content-Defined Chunking (CDC) algorithms over the past two decades. Despite advancements, the current state-of-the-art remains obscure, as a thorough and impartial analysis and comparison is lacking. We conduct a rigorous theoretical analysis and impartial experimental comparison of several leading CDC algorithms. Using four realistic datasets, we evaluate these algorithms against four key metrics: throughput, deduplication ratio, average chunk size, and chunk-size variance. Our analyses, in many instances, extend the findings of their original publications by reporting new results and putting existing ones into context. Moreover, we highlight limitations that have previously gone unnoticed. Our findings provide valuable insights that inform the selection and optimization of CDC algorithms for practical applications in data deduplication.
Paper Structure (51 sections, 3 equations, 11 figures, 10 tables, 7 algorithms)

This paper contains 51 sections, 3 equations, 11 figures, 10 tables, 7 algorithms.

Figures (11)

  • Figure 1: Boundary-shift problem.
  • Figure 2: Window sliding over a file byte-by-byte.
  • Figure 3: Taxonomy of evaluated chunking algorithms.
  • Figure 4: Illustration of the AE chunking algorithm on a sequence of 27 bytes and a horizon $h=4$. The vertical red lines mark the cut points which then determine the resulting chunks $c_i$.
  • Figure 5: CDC throughput, median values and quartiles, $\mu=\qty{2}{\kibi\byte}$, RAND dataset. BFBC-L indicates BFBC on the CODE dataset.
  • ...and 6 more figures