Table of Contents
Fetching ...

Canonical Variates in Wasserstein Metric Space

Jia Li, Lin Lin

TL;DR

This work extends Fisher's linear discriminant ideas to distributions represented as data clouds by leveraging the Wasserstein metric. It introduces a dimension-reduction framework (CVW) that maximizes a Fisher-like ratio of between-class to within-class pairwise Wasserstein distances via an alternating optimal-transport and projection algorithm (OTAF). The method uses both discrete distributions and Gaussian mixtures (MAW) to compute the objective and derives a Rayleigh-Ritz surrogate to enable efficient optimization. Empirical results on pulmonary fibrosis, breast cancer, and uveal melanoma datasets show substantial accuracy and AUC gains over vector-based classifiers and robustness to changes in GMM representations and clustering schemes. The approach is parallelizable and offers flexibility in how distributional data are represented and processed, highlighting its practical impact for distributional data classification in biomedical contexts.

Abstract

In this paper, we address the classification of instances each characterized not by a singular point, but by a distribution on a vector space. We employ the Wasserstein metric to measure distances between distributions, which are then used by distance-based classification algorithms such as k-nearest neighbors, k-means, and pseudo-mixture modeling. Central to our investigation is dimension reduction within the Wasserstein metric space to enhance classification accuracy. We introduce a novel approach grounded in the principle of maximizing Fisher's ratio, defined as the quotient of between-class variation to within-class variation. The directions in which this ratio is maximized are termed discriminant coordinates or canonical variates axes. In practice, we define both between-class and within-class variations as the average squared distances between pairs of instances, with the pairs either belonging to the same class or to different classes. This ratio optimization is achieved through an iterative algorithm, which alternates between optimal transport and maximization steps within the vector space. We conduct empirical studies to assess the algorithm's convergence and, through experimental validation, demonstrate that our dimension reduction technique substantially enhances classification performance. Moreover, our method outperforms well-established algorithms that operate on vector representations derived from distributional data. It also exhibits robustness against variations in the distributional representations of data clouds.

Canonical Variates in Wasserstein Metric Space

TL;DR

This work extends Fisher's linear discriminant ideas to distributions represented as data clouds by leveraging the Wasserstein metric. It introduces a dimension-reduction framework (CVW) that maximizes a Fisher-like ratio of between-class to within-class pairwise Wasserstein distances via an alternating optimal-transport and projection algorithm (OTAF). The method uses both discrete distributions and Gaussian mixtures (MAW) to compute the objective and derives a Rayleigh-Ritz surrogate to enable efficient optimization. Empirical results on pulmonary fibrosis, breast cancer, and uveal melanoma datasets show substantial accuracy and AUC gains over vector-based classifiers and robustness to changes in GMM representations and clustering schemes. The approach is parallelizable and offers flexibility in how distributional data are represented and processed, highlighting its practical impact for distributional data classification in biomedical contexts.

Abstract

In this paper, we address the classification of instances each characterized not by a singular point, but by a distribution on a vector space. We employ the Wasserstein metric to measure distances between distributions, which are then used by distance-based classification algorithms such as k-nearest neighbors, k-means, and pseudo-mixture modeling. Central to our investigation is dimension reduction within the Wasserstein metric space to enhance classification accuracy. We introduce a novel approach grounded in the principle of maximizing Fisher's ratio, defined as the quotient of between-class variation to within-class variation. The directions in which this ratio is maximized are termed discriminant coordinates or canonical variates axes. In practice, we define both between-class and within-class variations as the average squared distances between pairs of instances, with the pairs either belonging to the same class or to different classes. This ratio optimization is achieved through an iterative algorithm, which alternates between optimal transport and maximization steps within the vector space. We conduct empirical studies to assess the algorithm's convergence and, through experimental validation, demonstrate that our dimension reduction technique substantially enhances classification performance. Moreover, our method outperforms well-established algorithms that operate on vector representations derived from distributional data. It also exhibits robustness against variations in the distributional representations of data clouds.
Paper Structure (17 sections, 4 theorems, 42 equations, 6 figures, 1 algorithm)

This paper contains 17 sections, 4 theorems, 42 equations, 6 figures, 1 algorithm.

Key Result

Lemma 1

The between-class variation $\bar{V}_B(A,\Pi^{*})=tr(A^t C_B A)$ and the within-class variation $\bar{V}_W(A,\Pi^{*})=tr(A^t C_W A)$.

Figures (6)

  • Figure 1: Two approaches for representing data clouds. Approach 1: data points in each instance (a data cloud) are clustered separately. Approach 2: data points in all instances are pooled and clustered together with coherent cluster labels assigned. Cluster-wise feature extraction can be performed to create vector representations.
  • Figure 2: Performance comparison based on AUC and classification accuracy. (a) Results obtained from GMMs generated by the combined clustering and separate clustering schemes. The number of components in GMMs varies over $\{3, 5, 7, 10\}$. The original dimension is $30$. With dimension reduction, one canonical variate is used. (b) Results obtained by vector-based algorithms SVM, RF, and LR are compared with CVW-C and CVW-S. In the legends on the right hand side of the plot, the average performance over the four different numbers of components is shown for each algorithm, and the value inside the parenthesis is the standard deviation.
  • Figure 3: Examine the robustness of CVW when GMM representations of training and test data are generated under different setups. (a) AUC; (b) Classification accuracy. There are 8 datasets each contain GMMs created under a specific setup. Each dataset is used once for training, and for any given training dataset, the test data are taken from the 8 datasets respectively. The horizontal axis shows the training dataset ID, while at each training data set, AUCs or accuracy levels based on the 8 different test datasets are shown. When training and test samples are from the same dataset, the result is shown by a circled cross.
  • Figure 4: Classification performance on breast cancer (BC) and un_melanoma (UM) data. Top row: BC. Bottom row: UM. (a) and (c): Results across different numbers of canonical variates for GMM-S with $\zeta=7$; (b) and (d): Compare SVM, RF, LR, CVW-C, and CVW-S over $\zeta\in\{3, 5, 7, 10\}$. In the legends on the right hand side, average AUCs and accuracy levels across $\zeta$'s with standard deviation is listed.
  • Figure 5: Convergence study based on BC, UM, and PF datasets. For each dataset, the Fisher's ratio of variations and the Grassmann distance between subspaces found in consecutive iterations are shown with respect to the number of iterations. (a)$\sim$(c): Orthonormal projection for datasets BC ($d=40$), UM ($d=13$), PF ($d=1$) respectively. (d)$\sim$(f): Non-orthonormal project for BC ($d=40$), UM ($d=13$), and UM ($d=25$). Here, $d$ denotes The number of canonical variates.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Theorem 1