Table of Contents
Fetching ...

Conditional Distribution Compression via the Kernel Conditional Mean Embedding

Dominic Broadbent, Nick Whiteley, Robert Allison, Tom Lovett

TL;DR

We address the problem of compressing conditional distributions $\mathbb{P}_{Y|X}$ for labelled data by embedding conditional laws in a kernel mean embedding space and introducing the AMCMD metric. The main approach combines a theoretical foundation for AMCMD with practical linear-time algorithms ACKH and ACKIP that directly compress the conditional distribution, achieving substantial reductions in computation from $\mathcal{O}(n^3)$ to $\mathcal{O}(n)$ (estimation) and $\mathcal{O}(m^3 + m^2 n)$ (joint optimisation). The key contributions include a consistent estimator and rate for AMCMD, a derivation showing AMCMD-based compression can be performed in linear time, and extensive experiments showing ACKIP outperforms joint-distribution methods (JKH/JKIP) and greedy ACKH across continuous and discrete settings. The work demonstrates that preserving conditional distributions directly yields better downstream KCME performance and scalable, interpretable coreset construction, broadening the applicability of kernel-based distribution compression to labelled data tasks. Overall, the paper provides a principled, scalable route to conditional distribution compression with strong empirical support and a clear path for future refinement and convergence analysis.

Abstract

Existing distribution compression methods, like Kernel Herding (KH), were originally developed for unlabelled data. However, no existing approach directly compresses the conditional distribution of labelled data. To address this gap, we first introduce the Average Maximum Conditional Mean Discrepancy (AMCMD), a natural metric for comparing conditional distributions. We then derive a consistent estimator for the AMCMD and establish its rate of convergence. Next, we make a key observation: in the context of distribution compression, the cost of constructing a compressed set targeting the AMCMD can be reduced from $\mathcal{O}(n^3)$ to $\mathcal{O}(n)$. Building on this, we extend the idea of KH to develop Average Conditional Kernel Herding (ACKH), a linear-time greedy algorithm that constructs a compressed set targeting the AMCMD. To better understand the advantages of directly compressing the conditional distribution rather than doing so via the joint distribution, we introduce Joint Kernel Herding (JKH), a straightforward adaptation of KH designed to compress the joint distribution of labelled data. While herding methods provide a simple and interpretable selection process, they rely on a greedy heuristic. To explore alternative optimisation strategies, we propose Joint Kernel Inducing Points (JKIP) and Average Conditional Kernel Inducing Points (ACKIP), which jointly optimise the compressed set while maintaining linear complexity. Experiments show that directly preserving conditional distributions with ACKIP outperforms both joint distribution compression (via JKH and JKIP) and the greedy selection used in ACKH. Moreover, we see that JKIP consistently outperforms JKH.

Conditional Distribution Compression via the Kernel Conditional Mean Embedding

TL;DR

We address the problem of compressing conditional distributions for labelled data by embedding conditional laws in a kernel mean embedding space and introducing the AMCMD metric. The main approach combines a theoretical foundation for AMCMD with practical linear-time algorithms ACKH and ACKIP that directly compress the conditional distribution, achieving substantial reductions in computation from to (estimation) and (joint optimisation). The key contributions include a consistent estimator and rate for AMCMD, a derivation showing AMCMD-based compression can be performed in linear time, and extensive experiments showing ACKIP outperforms joint-distribution methods (JKH/JKIP) and greedy ACKH across continuous and discrete settings. The work demonstrates that preserving conditional distributions directly yields better downstream KCME performance and scalable, interpretable coreset construction, broadening the applicability of kernel-based distribution compression to labelled data tasks. Overall, the paper provides a principled, scalable route to conditional distribution compression with strong empirical support and a clear path for future refinement and convergence analysis.

Abstract

Existing distribution compression methods, like Kernel Herding (KH), were originally developed for unlabelled data. However, no existing approach directly compresses the conditional distribution of labelled data. To address this gap, we first introduce the Average Maximum Conditional Mean Discrepancy (AMCMD), a natural metric for comparing conditional distributions. We then derive a consistent estimator for the AMCMD and establish its rate of convergence. Next, we make a key observation: in the context of distribution compression, the cost of constructing a compressed set targeting the AMCMD can be reduced from to . Building on this, we extend the idea of KH to develop Average Conditional Kernel Herding (ACKH), a linear-time greedy algorithm that constructs a compressed set targeting the AMCMD. To better understand the advantages of directly compressing the conditional distribution rather than doing so via the joint distribution, we introduce Joint Kernel Herding (JKH), a straightforward adaptation of KH designed to compress the joint distribution of labelled data. While herding methods provide a simple and interpretable selection process, they rely on a greedy heuristic. To explore alternative optimisation strategies, we propose Joint Kernel Inducing Points (JKIP) and Average Conditional Kernel Inducing Points (ACKIP), which jointly optimise the compressed set while maintaining linear complexity. Experiments show that directly preserving conditional distributions with ACKIP outperforms both joint distribution compression (via JKH and JKIP) and the greedy selection used in ACKH. Moreover, we see that JKIP consistently outperforms JKH.

Paper Structure

This paper contains 60 sections, 14 theorems, 150 equations, 32 figures, 2 tables, 8 algorithms.

Key Result

Theorem 2.1

(Theorem 5.2. Park2020MeasureTheoryCMMD) Suppose $l:\mathcal{Y}\times\mathcal{Y}\to\mathbb{R}$ is characteristic, that $\mathbb{P}_X$ and $\mathbb{P}_{X^\prime}$ are absolutely continuous with respect to each other, and that $\mathbb{P}(\cdot \mid X)$ and $\mathbb{P}(\cdot \mid X^\prime)$ admit regu

Figures (32)

  • Figure 1: Compressed set of size $m = 25$ generated by ACKIP (green), initialised with uniformly at random subsample (yellow).
  • Figure 2: Results for the true conditional distribution compression task with parameters set as $a_0 = -0.5$, $a_1 = 0.5$, $\mu = 1$, $\sigma^2 = 1$, and $\sigma_\epsilon^2 = 0.5$. The $\text{AMCMD}^2$ (first plot), and the RMSE across three test functions, versus the size of the compressed set is reported. For JKH (orange), JKIP (red), ACKH (blue), and ACKIP (green), we display the median performance (bold line) with the 25th-75th percentiles (shaded region) over 20 runs. The error of random sampling (black) over 500 runs is also plotted for comparison.
  • Figure 3: Results of the true conditional distribution compression task for compressed sets of size $m = 500$. The RMSE across a variety of test functions is reported, with the IQR highlighted for each method. Outliers are calculated as being above $Q_3 + 1.5\text{IQR}$ and below $Q_1 - 1.5\text{IQR}$.
  • Figure 4: RMSE versus size of compressed set for the Superconductivity data; the RMSE is calculated against the full data estimates of $\mathbb{E}[h(Y)\mid X=\bm{x}_i]$ as the true values are not available.
  • Figure 5: RMSE achieved by compressed sets of size $m = 250$ constructed by each method for the Superconductivity data. The IQR is highlighted for each method with outliers calculated as being above $Q_3 + 1.5\text{IQR}$ and below $Q_1 - 1.5\text{IQR}$.
  • ...and 27 more figures

Theorems & Definitions (23)

  • Theorem 2.1
  • Theorem 4.1
  • Remark 4.2
  • Lemma 4.3
  • Corollary 4.4
  • Remark 4.5
  • Remark 4.6
  • Lemma 4.7
  • Remark 4.8
  • Remark B.1
  • ...and 13 more