Table of Contents
Fetching ...

Bilateral Distribution Compression: Reducing Both Data Size and Dimensionality

Dominic Broadbent, Nick Whiteley, Robert Allison, Tom Lovett

TL;DR

BDC introduces a principled framework to compress both the sample size and feature dimension while preserving the original distribution, by coupling a latent autoencoder trained with Reconstruction MMD (RMMD) to a latent-space compression step guided by Encoded MMD (EMMD). The Decoded MMD (DMMD) serves as the central distributional metric, with theoretical guarantees showing that vanishing RMMD and EMMD imply vanishing DMMD, and, under a pull-back kernel, DMMD is bounded by the sum RMMD+EMMD. The authors prove connections to PCA under certain kernels and extend the approach to labelled data via tensor-product RKHS, providing a flexible, task-agnostic compression scheme. Empirically, BDC achieves comparable or superior downstream performance to ambient-space compression at substantially lower cost and with much higher compression rates, across regression, classification, and clustering tasks, including exact Gaussian-case demonstrations and manifold-aware uncertainty quantification using Gaussian Processes.

Abstract

Existing distribution compression methods reduce the number of observations in a dataset by minimising the Maximum Mean Discrepancy (MMD) between original and compressed sets, but modern datasets are often large in both sample size and dimensionality. We propose Bilateral Distribution Compression (BDC), a two-stage framework that compresses along both axes while preserving the underlying distribution, with overall linear time and memory complexity in dataset size and dimension. Central to BDC is the Decoded MMD (DMMD), which we introduce to quantify the discrepancy between the original data and a compressed set decoded from a low-dimensional latent space. BDC proceeds by (i) learning a low-dimensional projection using the Reconstruction MMD (RMMD), and (ii) optimising a latent compressed set with the Encoded MMD (EMMD). We show that this procedure minimises the DMMD, guaranteeing that the compressed set faithfully represents the original distribution. Experiments show that BDC can achieve comparable or superior downstream task performance to ambient-space compression at substantially lower cost and with significantly higher rates of compression.

Bilateral Distribution Compression: Reducing Both Data Size and Dimensionality

TL;DR

BDC introduces a principled framework to compress both the sample size and feature dimension while preserving the original distribution, by coupling a latent autoencoder trained with Reconstruction MMD (RMMD) to a latent-space compression step guided by Encoded MMD (EMMD). The Decoded MMD (DMMD) serves as the central distributional metric, with theoretical guarantees showing that vanishing RMMD and EMMD imply vanishing DMMD, and, under a pull-back kernel, DMMD is bounded by the sum RMMD+EMMD. The authors prove connections to PCA under certain kernels and extend the approach to labelled data via tensor-product RKHS, providing a flexible, task-agnostic compression scheme. Empirically, BDC achieves comparable or superior downstream performance to ambient-space compression at substantially lower cost and with much higher compression rates, across regression, classification, and clustering tasks, including exact Gaussian-case demonstrations and manifold-aware uncertainty quantification using Gaussian Processes.

Abstract

Existing distribution compression methods reduce the number of observations in a dataset by minimising the Maximum Mean Discrepancy (MMD) between original and compressed sets, but modern datasets are often large in both sample size and dimensionality. We propose Bilateral Distribution Compression (BDC), a two-stage framework that compresses along both axes while preserving the underlying distribution, with overall linear time and memory complexity in dataset size and dimension. Central to BDC is the Decoded MMD (DMMD), which we introduce to quantify the discrepancy between the original data and a compressed set decoded from a low-dimensional latent space. BDC proceeds by (i) learning a low-dimensional projection using the Reconstruction MMD (RMMD), and (ii) optimising a latent compressed set with the Encoded MMD (EMMD). We show that this procedure minimises the DMMD, guaranteeing that the compressed set faithfully represents the original distribution. Experiments show that BDC can achieve comparable or superior downstream task performance to ambient-space compression at substantially lower cost and with significantly higher rates of compression.

Paper Structure

This paper contains 65 sections, 11 theorems, 91 equations, 26 figures, 2 tables, 2 algorithms.

Key Result

Theorem 3.1

Assume the distribution $\mathbb{P}_X$ has zero mean, and let the kernel $k : \mathbb{R}^d \times \mathbb{R}^d \to \mathbb{R}$ be the quadratic kernel defined by $k(\bm{x}, \bm{y}) = (1 + \bm{x}^\top \bm{y})^2$. Then $V^{\text{RMMD}}_*$ is given by (a permutation of) the top $p$ eigenvectors of the

Figures (26)

  • Figure 1: The Bilateral Distribution Compression framework.
  • Figure 2: Latent representations of a dataset constructed by sampling two well-separated clusters from a $50$-dimensional Gaussian with identity covariance. Training with MSRE preserves cluster separation but misaligns reconstructed distributions (higher test RMMD), RMMD aligns distributions but loses cluster separation, while RMMD $+$ MSRE balances both.
  • Figure 3: Left: Swiss-roll manifold, coloured with value of $u$. Right: test mean squared error on Swiss-Roll. BDC-L (red), BDC-NL (green), and ADC (blue) each reported over $25$ runs, URS (grey) over $1000$ runs, and FULL is shown as a black dashed line.
  • Figure 4: Left: test negative log-likelihood on Swiss-Roll. Right: test continuous ranked probability score on Swiss-Roll. BDC-L (red), BDC-NL (green), and ADC (blue) each reported over $25$ runs, URS (grey) over $1000$ runs, and FULL is shown as a black dashed line.
  • Figure 5: Test mean squared error on CT-Slice. BDC-L (red), BDC-NL (green), and ADC (blue) each reported over $10$ runs, URS (grey) over $200$ runs.
  • ...and 21 more figures

Theorems & Definitions (16)

  • Theorem 3.1
  • Remark 3.2
  • Theorem 3.3
  • Remark 3.4
  • Theorem 3.5
  • Remark 3.6
  • Theorem 1.1
  • Theorem 2.1
  • Theorem 2.2
  • Theorem 2.3
  • ...and 6 more