Learning to Embed Distributions via Maximum Kernel Entropy
Oleksii Kachaiev, Stefano Recanatesi
TL;DR
The paper tackles learning a data-dependent kernel for distribution regression by maximizing entropy of the dataset covariance embedding in RKHS. It introduces Maximum Distribution Kernel Entropy (MDKE), an unsupervised objective that trains a latent encoder to produce distributions whose kernel covariance has high entropy, thereby increasing inter-distribution separation while reducing intra-distribution variance. The authors establish theoretical links between entropy, distributional variance, and the geometry of mean embeddings, and demonstrate improvements across flow cytometry, image histograms, and text distributions, including strong MNIST and 20 Newsgroups results after unsupervised pre-training. This approach offers a theoretically grounded alternative to hand-crafted kernels and provides a versatile framework for learning distributional representations in several modalities. The work opens avenues for integrating data-dependent kernels with downstream kernel methods in settings where inputs are distributions rather than fixed vectors.
Abstract
Empirical data can often be considered as samples from a set of probability distributions. Kernel methods have emerged as a natural approach for learning to classify these distributions. Although numerous kernels between distributions have been proposed, applying kernel methods to distribution regression tasks remains challenging, primarily because selecting a suitable kernel is not straightforward. Surprisingly, the question of learning a data-dependent distribution kernel has received little attention. In this paper, we propose a novel objective for the unsupervised learning of data-dependent distribution kernel, based on the principle of entropy maximization in the space of probability measure embeddings. We examine the theoretical properties of the latent embedding space induced by our objective, demonstrating that its geometric structure is well-suited for solving downstream discriminative tasks. Finally, we demonstrate the performance of the learned kernel across different modalities.
