Table of Contents
Fetching ...

Wasserstein-based Kernel Principal Component Analysis for Clustering Applications

Alfredo Oneto, Blazhe Gjorgiev, Giovanni Sansavini

TL;DR

This work tackles unsupervised clustering of objects represented as distributions by uniting Wasserstein distances with kernel methods. It introduces a scalable framework that (i) approximates pairwise Wasserstein distances using multiple reference distributions, (ii) builds shifted positive-definite Wasserstein-based kernels and derives feature maps via kernel PCA with Nyström approximation, and (iii) uses distance-agnostic validity indices to optimize kernel parameters through Bayesian optimization. The approach is validated on real-world time series and power-distribution graphs, showing competitive or superior clustering performance with favorable computational efficiency. Overall, the framework enables robust, scalable clustering of distributional data across domains by integrating transport-based similarity with kernel-derived representations and principled parameter tuning.

Abstract

Many data clustering applications must handle objects that cannot be represented as vectors. In this context, the bag-of-vectors representation describes complex objects through discrete distributions, for which the Wasserstein distance provides a well-conditioned dissimilarity measure. Kernel methods extend this by embedding distance information into feature spaces that facilitate analysis. However, an unsupervised framework that combines kernels with Wasserstein distances for clustering distributional data is still lacking. We address this gap by introducing a computationally tractable framework that integrates Wasserstein metrics with kernel methods for clustering. The framework can accommodate both vectorial and distributional data, enabling applications in various domains. It comprises three components: (i) an efficient approximation of pairwise Wasserstein distances using multiple reference distributions; (ii) shifted positive definite kernel functions based on Wasserstein distances, combined with kernel principal component analysis for feature mapping; and (iii) scalable, distance-agnostic validity indices for clustering evaluation and kernel parameter optimization. Experiments on power distribution graphs and real-world time series demonstrate the effectiveness and efficiency of the proposed framework.

Wasserstein-based Kernel Principal Component Analysis for Clustering Applications

TL;DR

This work tackles unsupervised clustering of objects represented as distributions by uniting Wasserstein distances with kernel methods. It introduces a scalable framework that (i) approximates pairwise Wasserstein distances using multiple reference distributions, (ii) builds shifted positive-definite Wasserstein-based kernels and derives feature maps via kernel PCA with Nyström approximation, and (iii) uses distance-agnostic validity indices to optimize kernel parameters through Bayesian optimization. The approach is validated on real-world time series and power-distribution graphs, showing competitive or superior clustering performance with favorable computational efficiency. Overall, the framework enables robust, scalable clustering of distributional data across domains by integrating transport-based similarity with kernel-derived representations and principled parameter tuning.

Abstract

Many data clustering applications must handle objects that cannot be represented as vectors. In this context, the bag-of-vectors representation describes complex objects through discrete distributions, for which the Wasserstein distance provides a well-conditioned dissimilarity measure. Kernel methods extend this by embedding distance information into feature spaces that facilitate analysis. However, an unsupervised framework that combines kernels with Wasserstein distances for clustering distributional data is still lacking. We address this gap by introducing a computationally tractable framework that integrates Wasserstein metrics with kernel methods for clustering. The framework can accommodate both vectorial and distributional data, enabling applications in various domains. It comprises three components: (i) an efficient approximation of pairwise Wasserstein distances using multiple reference distributions; (ii) shifted positive definite kernel functions based on Wasserstein distances, combined with kernel principal component analysis for feature mapping; and (iii) scalable, distance-agnostic validity indices for clustering evaluation and kernel parameter optimization. Experiments on power distribution graphs and real-world time series demonstrate the effectiveness and efficiency of the proposed framework.

Paper Structure

This paper contains 29 sections, 1 theorem, 28 equations, 8 figures, 8 tables, 1 algorithm.

Key Result

Theorem 3.1

Given an arbitrary $S \times S$ kernel matrix, the optimal clustering assignments of the kernel k-medoids problem, Eqs. (eq11acost)-(eq11dcost), are invariant to a diagonal shift.

Figures (8)

  • Figure 1: Illustration of the linear optimal transport distance variables. The reference $\boldsymbol{\sigma}$ is used to approximate the Wasserstein distance between the distributions $\boldsymbol{\mu}_i$ and $\boldsymbol{\mu}_j$.
  • Figure 2: The clustering results for the Italy time series dataset. The top row shows the time domain, where we represent the median value for each cluster with dashed lines and display the medoids with solid lines. One and two standard deviations from the medoids are depicted with opaque bands. The time series are unitless and presented in their original, zero-centered format. The bottom row shows the frequency domain, including the medoids, the median values, and bands around the medoids of the NPSD at each frequency, expressed in CPD.
  • Figure 3: The cumulative density of approximation errors for the Wasserstein distance is shown for MV (left) and LV (right) graphs. Dashed lines indicate the 70th and 90th percentiles of the errors within each cumulative distribution.
  • Figure 4: Validity indices plotted against the explained variance of the first five components. The validity indices corresponding to the selected clustering results are indicated with black diamonds.
  • Figure 5: Projection via t-SNE of the MV graphs feature maps. The cluster medoids are indicated with a black triangle. The clusters are named after the identifier of the medoid in the dataset.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Theorem 3.1
  • proof