Table of Contents
Fetching ...

Algorithms for Massive Data -- Lecture Notes

Nicola Prezza

TL;DR

The notes survey core techniques for processing data that far exceeds memory, distinguishing lossless compressed data structures from lossy sketches. They detail compressed text indexes (suffix arrays, suffix trees, CSA, FM-index) built around entropy concepts and the Burrows-Wheeler transform, achieving near-optimal space and efficient pattern queries. They then cover probabilistic tools, hashing, and probabilistic filters (Bloom, counting Bloom, quotient filters) to enable compact membership and similarity queries, culminating in similarity-preserving sketches such as Rabin hashing and MinHash. Together, these methods underpin scalable search, retrieval, and analytics on massive data, with concrete space bounds and query-time guarantees. The practical impact spans information retrieval, computational biology, and large-scale data processing where exact storage is infeasible but fast, approximate answers are sufficient.

Abstract

These are the lecture notes for the course CM0622 - Algorithms for Massive Data, Ca' Foscari University of Venice. The goal of this course is to introduce algorithmic techniques for dealing with massive data: data so large that it does not fit in the computer's memory. There are two main solutions to deal with massive data: (lossless) compressed data structures and (lossy) data sketches. These notes cover both topics: compressed suffix arrays, probabilistic filters, sketching under various metrics, Locality Sensitive Hashing, nearest neighbour search, algorithms on streams.

Algorithms for Massive Data -- Lecture Notes

TL;DR

The notes survey core techniques for processing data that far exceeds memory, distinguishing lossless compressed data structures from lossy sketches. They detail compressed text indexes (suffix arrays, suffix trees, CSA, FM-index) built around entropy concepts and the Burrows-Wheeler transform, achieving near-optimal space and efficient pattern queries. They then cover probabilistic tools, hashing, and probabilistic filters (Bloom, counting Bloom, quotient filters) to enable compact membership and similarity queries, culminating in similarity-preserving sketches such as Rabin hashing and MinHash. Together, these methods underpin scalable search, retrieval, and analytics on massive data, with concrete space bounds and query-time guarantees. The practical impact spans information retrieval, computational biology, and large-scale data processing where exact storage is infeasible but fast, approximate answers are sufficient.

Abstract

These are the lecture notes for the course CM0622 - Algorithms for Massive Data, Ca' Foscari University of Venice. The goal of this course is to introduce algorithmic techniques for dealing with massive data: data so large that it does not fit in the computer's memory. There are two main solutions to deal with massive data: (lossless) compressed data structures and (lossy) data sketches. These notes cover both topics: compressed suffix arrays, probabilistic filters, sketching under various metrics, Locality Sensitive Hashing, nearest neighbour search, algorithms on streams.
Paper Structure (127 sections, 96 theorems, 247 equations, 31 figures, 2 tables, 10 algorithms)

This paper contains 127 sections, 96 theorems, 247 equations, 31 figures, 2 tables, 10 algorithms.

Key Result

Theorem 1.1.5

The suffix trie uses $O(n^2)$ words of space, supports count queries in optimal time $O(m)$, and locate queries in optimal time $O(m+occ)$.

Figures (31)

  • Figure 1: Suffix trie of the string $\mathcal{T}=abaab\$$. Leaves are labeled with the starting position in $\mathcal{T}$ of the corresponding suffix of $\mathcal{T}$. In addition to the tree's edges, we also store additional information: (1) on each node $x$, we store two pointers to the leftmost and rigthtmost leaf in the subtree rooted in $x$ (for clarity, in the example we show these pointers --- in dashed green --- only on one node), (2) on each node $x$ such that the string read from the root to $x$ is $s$, we store $\mathtt{count}(s)$ (we do not show this information in the figure; for example, on the node reached by reading string $"a"$ from the root, this value would be $\mathtt{count}("a")=3$), and (3) we link the leaves from left to right using a linked list (shown in dashed red in the figure): $6 \rightarrow 3 \rightarrow 4 \rightarrow 1 \rightarrow 5 \rightarrow 2$. Pattern matching example: to find all occurrences of the string $"a"$, descend from the root reading $"a"$, use the extra (dashed green) pointers to jump on the leftmost (3) and rightmost (1) leaves in the subtree of the node, and starting from the leftmost leaf (3) follow the linked list of leaves until reaching the rightmost leaf (1). Proceeding in this way, we navigate the sub-list $3\rightarrow 4 \rightarrow 1$, corresponding to all occurrences of $"a"$.
  • Figure 2: Suffix tree of the string $\mathcal{T}=abaab\$$. For each node, we store the same extra information shown in Figure \ref{['fig:strie']} (not shown here for simplicity).
  • Figure 3: Table $T$ of Example \ref{['ex:tableT']}. Classes correspond to rows and offsets to columns.
  • Figure 4: The suffix array and sorted suffixes of $S = BANANA\$$.
  • Figure 5: The sorted suffixes of $S = BANANA\$$. Let's store just the first character (in black) of each suffix, in a string $F = \$AAABNN$.
  • ...and 26 more figures

Theorems & Definitions (240)

  • Definition 1.1.1: Full-text indexing
  • Definition 1.1.2
  • Example 1.1.3
  • Definition 1.1.4
  • Theorem 1.1.5
  • Theorem 1.1.6
  • Theorem 1.1.7
  • Definition 1.2.1: Worst-case entropy
  • Corollary 1.2.2
  • Example 1.2.3
  • ...and 230 more