Table of Contents
Fetching ...

Random Permutation Codes: Lossless Source Coding of Non-Sequential Data

Daniel Severo

TL;DR

This work gives a formal definition for non-sequential objects as random sets of equivalent sequences, which it refers to as Combinatorial Random Variables (CRVs), and establishes the non-sequential data type represented by the CRV.

Abstract

This thesis deals with the problem of communicating and storing non-sequential data. We investigate this problem through the lens of lossless source coding, also sometimes referred to as lossless compression, from both an algorithmic and information-theoretic perspective. Lossless compression algorithms typically preserve the ordering in which data points are compressed. However, there are data types where order is not meaningful, such as collections of files, rows in a database, nodes in a graph, and, notably, datasets in machine learning applications. Compressing with traditional algorithms is possible if we pick an order for the elements and communicate the corresponding ordered sequence. However, unless the order information is somehow removed during the encoding process, this procedure will be sub-optimal, because the order contains information and therefore more bits are used to represent the source than are truly necessary. In this work we give a formal definition for non-sequential objects as random sets of equivalent sequences, which we refer to as Combinatorial Random Variables (CRVs). The definition of equivalence, formalized as an equivalence relation, establishes the non-sequential data type represented by the CRV. The achievable rates of CRVs is fully characterized as a function of the equivalence relation as well as the data distribution. The optimal rates of CRVs are achieved within the family of Random Permutation Codes (RPCs) developed in later chapters. RPCs randomly select one-of-many possible sequences that can represent the instance of the CRV. Specialized RPCs are given for the case of multisets, graphs, and partitions/clusterings, providing new algorithms for compression of databases, social networks, and web data in the JSON file format.

Random Permutation Codes: Lossless Source Coding of Non-Sequential Data

TL;DR

This work gives a formal definition for non-sequential objects as random sets of equivalent sequences, which it refers to as Combinatorial Random Variables (CRVs), and establishes the non-sequential data type represented by the CRV.

Abstract

This thesis deals with the problem of communicating and storing non-sequential data. We investigate this problem through the lens of lossless source coding, also sometimes referred to as lossless compression, from both an algorithmic and information-theoretic perspective. Lossless compression algorithms typically preserve the ordering in which data points are compressed. However, there are data types where order is not meaningful, such as collections of files, rows in a database, nodes in a graph, and, notably, datasets in machine learning applications. Compressing with traditional algorithms is possible if we pick an order for the elements and communicate the corresponding ordered sequence. However, unless the order information is somehow removed during the encoding process, this procedure will be sub-optimal, because the order contains information and therefore more bits are used to represent the source than are truly necessary. In this work we give a formal definition for non-sequential objects as random sets of equivalent sequences, which we refer to as Combinatorial Random Variables (CRVs). The definition of equivalence, formalized as an equivalence relation, establishes the non-sequential data type represented by the CRV. The achievable rates of CRVs is fully characterized as a function of the equivalence relation as well as the data distribution. The optimal rates of CRVs are achieved within the family of Random Permutation Codes (RPCs) developed in later chapters. RPCs randomly select one-of-many possible sequences that can represent the instance of the CRV. Specialized RPCs are given for the case of multisets, graphs, and partitions/clusterings, providing new algorithms for compression of databases, social networks, and web data in the JSON file format.

Paper Structure

This paper contains 76 sections, 20 theorems, 210 equations, 14 figures, 5 tables, 8 algorithms.

Key Result

Theorem 1.1.14

The set of code-lengths $\{\ell_x \in \mathbb{N} \colon x \in \mathcal{X}\}$ of a prefix-free symbol code obey the following inequality, known as Kraft's Inequality, Conversely, for any set of code-lengths satisfying Kraft's Inequality, there exists a prefix-free symbol code with this set of code-lengths.

Figures (14)

  • Figure 1: Binary trees with binary strings (left) and $2^{-\ell}$ (right) as nodes. Codewords in bold form a prefix-free code for an alphabet of size $4$.
  • Figure 2: Binary tree for a prefix-free codebook $\{1, 00, 011\}$ (left). Optimal codebook for $\lvert\mathcal{X}\rvert=3$ (right).
  • Figure 3: Percentage increase, with respect to the optimal, from using an extended uniform code. See \ref{['example:optimal-rate-for-a-uniform-source']}.
  • Figure 4: ANS state change under BB-ANS.
  • Figure 5: A non-simple directed graph (left) and simple undirected graph (right).
  • ...and 9 more figures

Theorems & Definitions (125)

  • Definition 1.1.1: Lossless Source Code
  • Definition 1.1.2: Rate
  • Definition 1.1.3: Asymptotic Rate
  • Definition 1.1.4: Optimal Code
  • Example 1.1.5: Optimal Codes
  • Definition 1.1.6: Sequential Code
  • Example 1.1.7: Optimal Sequential Codes
  • Example 1.1.8: Encode and Decode functions
  • Definition 1.1.9: Extended Codes
  • Example 1.1.10: Fixed-length Codes
  • ...and 115 more