Table of Contents
Fetching ...

Learning to Embed Distributions via Maximum Kernel Entropy

Oleksii Kachaiev, Stefano Recanatesi

TL;DR

The paper tackles learning a data-dependent kernel for distribution regression by maximizing entropy of the dataset covariance embedding in RKHS. It introduces Maximum Distribution Kernel Entropy (MDKE), an unsupervised objective that trains a latent encoder to produce distributions whose kernel covariance has high entropy, thereby increasing inter-distribution separation while reducing intra-distribution variance. The authors establish theoretical links between entropy, distributional variance, and the geometry of mean embeddings, and demonstrate improvements across flow cytometry, image histograms, and text distributions, including strong MNIST and 20 Newsgroups results after unsupervised pre-training. This approach offers a theoretically grounded alternative to hand-crafted kernels and provides a versatile framework for learning distributional representations in several modalities. The work opens avenues for integrating data-dependent kernels with downstream kernel methods in settings where inputs are distributions rather than fixed vectors.

Abstract

Empirical data can often be considered as samples from a set of probability distributions. Kernel methods have emerged as a natural approach for learning to classify these distributions. Although numerous kernels between distributions have been proposed, applying kernel methods to distribution regression tasks remains challenging, primarily because selecting a suitable kernel is not straightforward. Surprisingly, the question of learning a data-dependent distribution kernel has received little attention. In this paper, we propose a novel objective for the unsupervised learning of data-dependent distribution kernel, based on the principle of entropy maximization in the space of probability measure embeddings. We examine the theoretical properties of the latent embedding space induced by our objective, demonstrating that its geometric structure is well-suited for solving downstream discriminative tasks. Finally, we demonstrate the performance of the learned kernel across different modalities.

Learning to Embed Distributions via Maximum Kernel Entropy

TL;DR

The paper tackles learning a data-dependent kernel for distribution regression by maximizing entropy of the dataset covariance embedding in RKHS. It introduces Maximum Distribution Kernel Entropy (MDKE), an unsupervised objective that trains a latent encoder to produce distributions whose kernel covariance has high entropy, thereby increasing inter-distribution separation while reducing intra-distribution variance. The authors establish theoretical links between entropy, distributional variance, and the geometry of mean embeddings, and demonstrate improvements across flow cytometry, image histograms, and text distributions, including strong MNIST and 20 Newsgroups results after unsupervised pre-training. This approach offers a theoretically grounded alternative to hand-crafted kernels and provides a versatile framework for learning distributional representations in several modalities. The work opens avenues for integrating data-dependent kernels with downstream kernel methods in settings where inputs are distributions rather than fixed vectors.

Abstract

Empirical data can often be considered as samples from a set of probability distributions. Kernel methods have emerged as a natural approach for learning to classify these distributions. Although numerous kernels between distributions have been proposed, applying kernel methods to distribution regression tasks remains challenging, primarily because selecting a suitable kernel is not straightforward. Surprisingly, the question of learning a data-dependent distribution kernel has received little attention. In this paper, we propose a novel objective for the unsupervised learning of data-dependent distribution kernel, based on the principle of entropy maximization in the space of probability measure embeddings. We examine the theoretical properties of the latent embedding space induced by our objective, demonstrating that its geometric structure is well-suited for solving downstream discriminative tasks. Finally, we demonstrate the performance of the learned kernel across different modalities.
Paper Structure (39 sections, 6 theorems, 35 equations, 5 figures, 1 table)

This paper contains 39 sections, 6 theorems, 35 equations, 5 figures, 1 table.

Key Result

Proposition 3.3

For a set of $M$ probability distributions $\mathcal{D}_M$, the second-order Rényi entropy $\mathcal{S}_2$ of the empirical covariance operator embedding $\hat{\Sigma}_\mathcal{D}$ induced by the choice of Gaussian distribution kernel$K_\text{RBF}$ over points in the RKHS $\mathcal{H}_\text{emb}$, - where $\gamma$ is the bandwidth of the distribution kernel$K_\text{RBF}$.

Figures (5)

  • Figure 1: Learning to embed distributions.\ref{['fig:1a']} Example of multiple distributions over the input space. \ref{['fig:1b']} The trainable function $f_\theta$ encodes the input dataset into a compact latent space, in our case $\mathcal{Z} = \mathcal{S}^{d-1}$. \ref{['fig:1c']} The first-level embedding kernel $k$ induces kernel mean embedding map to $\mathcal{H}$. The encoder is optimized to maximize the entropy of the covariance operator embedding of the dataset w.r.t. the second-level distribution kernel $K$ between kernel mean embeddings in $\mathcal{H}$. \ref{['fig:1d']} Utilizing learned data-dependent kernel, downstream classification tasks can be solved using tools such as Kernel SVM or Kernel Ridge Regression.
  • Figure 2: Properties of the entropy on the toy example.\ref{['fig:2a']} Entropy and Distributional Variance for 6 distributions on a sphere as a function of their geometrical arrangement parametrized by $\gamma$. \ref{['fig:2b']} Kernel norms that enter the distributional variance bound. The blue shaded area (difference between blue and red lines) corresponds to the dotted red line in \ref{['fig:2a']} (up to multiplicative factor). \ref{['fig:2c']} Flattening of Gram matrix eigenvalues as a function of $\gamma$.
  • Figure 3: The effect or regularization on the training dynamics. The distribution of the eigenvalues of the distribution kernel Gram matrix, calculated for 2,000 sentences sampled from '20 Newsgroups' dataset (details in \ref{['sec:exp-text']}), is observed throughout the training. \ref{['fig:5a']} Training with no regularization leads to the collapse of smaller eigenvalues. \ref{['fig:5b']} The regularization stabilizes the training by preventing eigenvalues from collapsing.
  • Figure 4: Unsupervised encoding of Images. Unsupervised learning of image embeddings as finite-support distributions (i.e., histograms) of pixel intensities. For every pixel position we assign a point location on the unit hypersphere and optimize such locations via the covariance operator dataset embedding w.r.t. the MDKE objective. \ref{['fig:3a']} Samples from the MNIST dataset and learned pixel-to-pixel interaction kernel Gram matrix. \ref{['fig:3b']} Spectral clustering of pixels based on the learned kernel Gram matrix. \ref{['fig:3c']} and \ref{['fig:3d']} same as \ref{['fig:3a']} and \ref{['fig:3b']} for Fashion-MNIST dataset.
  • Figure 5: Unsupervised encoding of Text. Unsupervised learning of sentences embeddings as empirical distributions of words on the '20 Newsgroup' dataset. Goodness of the learned embeddings is evaluated by performing sentence-to-topic classification. \ref{['fig:4a']} Distribution kernel entropy, distributional variance, and validation accuracy throughout training. \ref{['fig:4b']} Kernel norms \ref{['eq:distr_norm_gap']} throughout training. Shaded blue area (the difference between the blue and red lines) corresponds to the blue dotted line in panel \ref{['fig:4a']} (up to a multiplicative factor).

Theorems & Definitions (13)

  • Definition 3.2
  • Proposition 3.3
  • Proposition 3.4
  • Proposition 3.5
  • Lemma A.1
  • proof
  • Definition A.2
  • Lemma A.3
  • proof
  • Remark A.4
  • ...and 3 more