Random Permutation Codes: Lossless Source Coding of Non-Sequential Data

Daniel Severo

Random Permutation Codes: Lossless Source Coding of Non-Sequential Data

Daniel Severo

TL;DR

This work gives a formal definition for non-sequential objects as random sets of equivalent sequences, which it refers to as Combinatorial Random Variables (CRVs), and establishes the non-sequential data type represented by the CRV.

Abstract

This thesis deals with the problem of communicating and storing non-sequential data. We investigate this problem through the lens of lossless source coding, also sometimes referred to as lossless compression, from both an algorithmic and information-theoretic perspective. Lossless compression algorithms typically preserve the ordering in which data points are compressed. However, there are data types where order is not meaningful, such as collections of files, rows in a database, nodes in a graph, and, notably, datasets in machine learning applications. Compressing with traditional algorithms is possible if we pick an order for the elements and communicate the corresponding ordered sequence. However, unless the order information is somehow removed during the encoding process, this procedure will be sub-optimal, because the order contains information and therefore more bits are used to represent the source than are truly necessary. In this work we give a formal definition for non-sequential objects as random sets of equivalent sequences, which we refer to as Combinatorial Random Variables (CRVs). The definition of equivalence, formalized as an equivalence relation, establishes the non-sequential data type represented by the CRV. The achievable rates of CRVs is fully characterized as a function of the equivalence relation as well as the data distribution. The optimal rates of CRVs are achieved within the family of Random Permutation Codes (RPCs) developed in later chapters. RPCs randomly select one-of-many possible sequences that can represent the instance of the CRV. Specialized RPCs are given for the case of multisets, graphs, and partitions/clusterings, providing new algorithms for compression of databases, social networks, and web data in the JSON file format.

Random Permutation Codes: Lossless Source Coding of Non-Sequential Data

TL;DR

Abstract

Random Permutation Codes: Lossless Source Coding of Non-Sequential Data

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (125)