Table of Contents
Fetching ...

Data Compression with Relative Entropy Coding

Gergely Flamich

TL;DR

The thesis studies Data Compression with Relative Entropy Coding, a framework that generalizes classical source coding to uncertain, randomized information and continuous spaces, enabling privacy and perceptual considerations in ML-based compression. It develops tight fundamental limits via the channel simulation divergence, and introduces fast, Poisson-process–based samplers (rejection sampling, A* sampling, Greedy Poisson Rejection Sampling) that approach these limits while enabling parallelism and approximate schemes. It also introduces COMBINER, a practically efficient compression scheme using Bayesian implicit neural representations, demonstrating high performance on image, video, audio, and protein data with low energy and small models. The work bridges theory and practice by providing constructive algorithms with provable guarantees, practical implementation strategies, and clear guidelines for deployment in real-world ML pipelines. Overall, the combination of rigorous limits, fast sampler constructions, and energy-efficient compression architectures positions relative entropy coding as a viable, scalable alternative to conventional quantization-based methods for modern data compression tasks.

Abstract

Over the last few years, machine learning unlocked previously infeasible features for compression, such as providing guarantees for users' privacy or tailoring compression to specific data statistics (e.g., satellite images or audio recordings of animals) or users' audiovisual perception. This, in turn, has led to an explosion of theoretical investigations and insights that aim to develop new fundamental theories, methods and algorithms better suited for machine learning-based compressors. In this thesis, I contribute to this trend by investigating relative entropy coding, a mathematical framework that generalises classical source coding theory. Concretely, relative entropy coding deals with the efficient communication of uncertain or randomised information. One of its key advantages is that it extends compression methods to continuous spaces and can thus be integrated more seamlessly into modern machine learning pipelines than classical quantisation-based approaches. Furthermore, it is a natural foundation for developing advanced compression methods that are privacy-preserving or account for the perceptual quality of the reconstructed data. The thesis considers relative entropy coding at three conceptual levels: After introducing the basics of the framework, (1) I prove results that provide new, maximally tight fundamental limits to the communication and computational efficiency of relative entropy coding; (2) I use the theory of Poisson point processes to develop and analyse new relative entropy coding algorithms, whose performance attains the theoretic optima and (3) I showcase the strong practical performance of relative entropy coding by applying it to image, audio, video and protein data compression using small, energy-efficient, probabilistic neural networks called Bayesian implicit neural representations.

Data Compression with Relative Entropy Coding

TL;DR

The thesis studies Data Compression with Relative Entropy Coding, a framework that generalizes classical source coding to uncertain, randomized information and continuous spaces, enabling privacy and perceptual considerations in ML-based compression. It develops tight fundamental limits via the channel simulation divergence, and introduces fast, Poisson-process–based samplers (rejection sampling, A* sampling, Greedy Poisson Rejection Sampling) that approach these limits while enabling parallelism and approximate schemes. It also introduces COMBINER, a practically efficient compression scheme using Bayesian implicit neural representations, demonstrating high performance on image, video, audio, and protein data with low energy and small models. The work bridges theory and practice by providing constructive algorithms with provable guarantees, practical implementation strategies, and clear guidelines for deployment in real-world ML pipelines. Overall, the combination of rigorous limits, fast sampler constructions, and energy-efficient compression architectures positions relative entropy coding as a viable, scalable alternative to conventional quantization-based methods for modern data compression tasks.

Abstract

Over the last few years, machine learning unlocked previously infeasible features for compression, such as providing guarantees for users' privacy or tailoring compression to specific data statistics (e.g., satellite images or audio recordings of animals) or users' audiovisual perception. This, in turn, has led to an explosion of theoretical investigations and insights that aim to develop new fundamental theories, methods and algorithms better suited for machine learning-based compressors. In this thesis, I contribute to this trend by investigating relative entropy coding, a mathematical framework that generalises classical source coding theory. Concretely, relative entropy coding deals with the efficient communication of uncertain or randomised information. One of its key advantages is that it extends compression methods to continuous spaces and can thus be integrated more seamlessly into modern machine learning pipelines than classical quantisation-based approaches. Furthermore, it is a natural foundation for developing advanced compression methods that are privacy-preserving or account for the perceptual quality of the reconstructed data. The thesis considers relative entropy coding at three conceptual levels: After introducing the basics of the framework, (1) I prove results that provide new, maximally tight fundamental limits to the communication and computational efficiency of relative entropy coding; (2) I use the theory of Poisson point processes to develop and analyse new relative entropy coding algorithms, whose performance attains the theoretic optima and (3) I showcase the strong practical performance of relative entropy coding by applying it to image, audio, video and protein data compression using small, energy-efficient, probabilistic neural networks called Bayesian implicit neural representations.

Paper Structure

This paper contains 89 sections, 44 theorems, 311 equations, 21 figures, 2 tables, 8 algorithms.

Key Result

Theorem 2.1.1

Let ${\mathbf{x}} \sim P$ over some space $\mathcal{X}$, and let the optimal rate $R^*$ be defined as in eq:lossless_rate. Then, Furthermore, this lower bound is achievable in the sense that there exists a (not necessarily unique) code $(\mathtt{enc}, \mathtt{dec})$ such that Indeed, any code whose bitrate is within a constant of $\mathbb{H}\infdivent{{\mathbf{x}}}$ is called an entropy code.

Figures (21)

  • Figure 1: Example showing the two stages of AC in the encoding of the string ${\mathbf{x}}_{1:3} = \texttt{101}$ from the ternary alphabet $\{0, 1, 2\}$, with joint distribution $p(x_{1:3})$ over the symbols. AC terminates after six steps, after finding that $[0.001110, 0.001111)$ falls between the established lower and upper bounds, and hence outputs $C(\texttt{101}) = 001110$. In the figures, the symbols $0, 1$ and $2$ are represented by the colours orange, green and blue, respectively, and the length of the coloured bars between the black separators is proportional to the probability mass of the symbols. Note that the probability masses can change for each symbol in the sequence.
  • Figure 2: Illustration of relative entropy coding for the pair of dependent random variables ${\mathbf{x}}, {\mathbf{y}} \sim P_{{\mathbf{x}}, {\mathbf{y}}}$ using a selection sampler. The sender A and the receiver B share a sequence of i.i.d. $P_{\mathbf{y}}$-distributed samples as their common randomness ${\mathbf{z}}$. Then, upon receiving a source sample ${\mathbf{x}} \sim P_{{\mathbf{x}}}$, A uses a selection rule $K$ that selects one of the samples in the shared sequence such that ${\mathbf{y}}_K \sim P_{{\mathbf{y}} \mid {\mathbf{x}}}$. Since the selected index $K$ is discrete, A uses an appropriate entropy coding algorithm to efficiently encode $K$ and transmit it to B. Finally, B can recover a $P_{{\mathbf{y}} \mid {\mathbf{x}}}$-distributed sample by decoding $K$ and selecting the $K$th sample in the shared sequence. See the main text for details on the selection rule and the encoding process.
  • Figure 4: Illustration of the width function (\ref{['def:width_function']}) and the canonical representation of the density ratio (\ref{['lemma:width_function_properties']}.3 for the case when $Q = \mathcal{N}(0.3, 0.5^2)$ and $P = \mathcal{N}(0, 1)$. Left: the density ratio/Radon-Nikodym derivation $\frac{dQ}{dP}$. The $P$-measure of the red interval is the value of the width function $w_P(h)$ at $h = 1.5$. Middle: The width function $w_P(h)$, with the value at $h = 1.5$ marked out corresponding to the left plot. It is also the probability density function of the associated random variable $H$ (\ref{['lemma:width_function_properties']}.1). The shaded red area is the survival function $S(h)$ evaluated at $h = 1.5$ (\ref{['lemma:width_function_properties']}.2). Right: The canonical representation $\eta(p)$ of the density ratio (\ref{['lemma:width_function_properties']}.3). The length of the red interval is the value of the width function $w_P(h)$ evaluated at $h = 1.5$, corresponding to the interval marked in the left plot.
  • Figure 5: Numerical demonstration of the looseness of the channel simulation bound in \ref{['eq:csd_kl_sandwich_bound']}. (A) We plot $\Delta(Q, P)$ for $Q = \mathcal{L}(0, b)$ and $P = \mathcal{L}(0, 1)$ as a function of the target log-precision $-\ln b$. (B) We plot $\Delta(Q, P)$ for $Q = \mathcal{N}(1, 1/4)^{\otimes d}$ and $P = \mathcal{N}(0, 1)^{\otimes d}$ as a function of the dimension $d$.
  • Figure 6: Rejection sampler.
  • ...and 16 more figures

Theorems & Definitions (98)

  • Theorem 2.1.1: Shannon's noiseless source coding theorem.
  • Definition 2.3.1: Polish space
  • Definition 2.3.2: Channel simulation algorithm
  • Theorem 2.3.1: Lower bound on the description length of channel simulation algorithms
  • proof
  • Definition 2.3.3: Exact relative entropy coding
  • Definition 3.1.1: Width function
  • Lemma 3.1.1
  • proof
  • Definition 3.1.2: Channel Simulation Divergence
  • ...and 88 more