Algorithms for Massive Data -- Lecture Notes
Nicola Prezza
TL;DR
The notes survey core techniques for processing data that far exceeds memory, distinguishing lossless compressed data structures from lossy sketches. They detail compressed text indexes (suffix arrays, suffix trees, CSA, FM-index) built around entropy concepts and the Burrows-Wheeler transform, achieving near-optimal space and efficient pattern queries. They then cover probabilistic tools, hashing, and probabilistic filters (Bloom, counting Bloom, quotient filters) to enable compact membership and similarity queries, culminating in similarity-preserving sketches such as Rabin hashing and MinHash. Together, these methods underpin scalable search, retrieval, and analytics on massive data, with concrete space bounds and query-time guarantees. The practical impact spans information retrieval, computational biology, and large-scale data processing where exact storage is infeasible but fast, approximate answers are sufficient.
Abstract
These are the lecture notes for the course CM0622 - Algorithms for Massive Data, Ca' Foscari University of Venice. The goal of this course is to introduce algorithmic techniques for dealing with massive data: data so large that it does not fit in the computer's memory. There are two main solutions to deal with massive data: (lossless) compressed data structures and (lossy) data sketches. These notes cover both topics: compressed suffix arrays, probabilistic filters, sketching under various metrics, Locality Sensitive Hashing, nearest neighbour search, algorithms on streams.
