Wasserstein-based Kernel Principal Component Analysis for Clustering Applications
Alfredo Oneto, Blazhe Gjorgiev, Giovanni Sansavini
TL;DR
This work tackles unsupervised clustering of objects represented as distributions by uniting Wasserstein distances with kernel methods. It introduces a scalable framework that (i) approximates pairwise Wasserstein distances using multiple reference distributions, (ii) builds shifted positive-definite Wasserstein-based kernels and derives feature maps via kernel PCA with Nyström approximation, and (iii) uses distance-agnostic validity indices to optimize kernel parameters through Bayesian optimization. The approach is validated on real-world time series and power-distribution graphs, showing competitive or superior clustering performance with favorable computational efficiency. Overall, the framework enables robust, scalable clustering of distributional data across domains by integrating transport-based similarity with kernel-derived representations and principled parameter tuning.
Abstract
Many data clustering applications must handle objects that cannot be represented as vectors. In this context, the bag-of-vectors representation describes complex objects through discrete distributions, for which the Wasserstein distance provides a well-conditioned dissimilarity measure. Kernel methods extend this by embedding distance information into feature spaces that facilitate analysis. However, an unsupervised framework that combines kernels with Wasserstein distances for clustering distributional data is still lacking. We address this gap by introducing a computationally tractable framework that integrates Wasserstein metrics with kernel methods for clustering. The framework can accommodate both vectorial and distributional data, enabling applications in various domains. It comprises three components: (i) an efficient approximation of pairwise Wasserstein distances using multiple reference distributions; (ii) shifted positive definite kernel functions based on Wasserstein distances, combined with kernel principal component analysis for feature mapping; and (iii) scalable, distance-agnostic validity indices for clustering evaluation and kernel parameter optimization. Experiments on power distribution graphs and real-world time series demonstrate the effectiveness and efficiency of the proposed framework.
