Table of Contents
Fetching ...

Graph Integration for Diffusion-Based Manifold Alignment

Jake S. Rhodes, Adam G. Rustad

TL;DR

This paper compares SPUD and MASH with existing semi-supervised manifold alignment methods and shows that they outperform competing methods in aligning true correspondences and cross-domain classification and how these methods can be applied to transfer label information between domains.

Abstract

Data from individual observations can originate from various sources or modalities but are often intrinsically linked. Multimodal data integration can enrich information content compared to single-source data. Manifold alignment is a form of data integration that seeks a shared, underlying low-dimensional representation of multiple data sources that emphasizes similarities between alternative representations of the same entities. Semi-supervised manifold alignment relies on partially known correspondences between domains, either through shared features or through other known associations. In this paper, we introduce two semi-supervised manifold alignment methods. The first method, Shortest Paths on the Union of Domains (SPUD), forms a unified graph structure using known correspondences to establish graph edges. By learning inter-domain geodesic distances, SPUD creates a global, multi-domain structure. The second method, MASH (Manifold Alignment via Stochastic Hopping), learns local geometry within each domain and forms a joint diffusion operator using known correspondences to iteratively learn new inter-domain correspondences through a random-walk approach. Through the diffusion process, MASH forms a coupling matrix that links heterogeneous domains into a unified structure. We compare SPUD and MASH with existing semi-supervised manifold alignment methods and show that they outperform competing methods in aligning true correspondences and cross-domain classification. In addition, we show how these methods can be applied to transfer label information between domains.

Graph Integration for Diffusion-Based Manifold Alignment

TL;DR

This paper compares SPUD and MASH with existing semi-supervised manifold alignment methods and shows that they outperform competing methods in aligning true correspondences and cross-domain classification and how these methods can be applied to transfer label information between domains.

Abstract

Data from individual observations can originate from various sources or modalities but are often intrinsically linked. Multimodal data integration can enrich information content compared to single-source data. Manifold alignment is a form of data integration that seeks a shared, underlying low-dimensional representation of multiple data sources that emphasizes similarities between alternative representations of the same entities. Semi-supervised manifold alignment relies on partially known correspondences between domains, either through shared features or through other known associations. In this paper, we introduce two semi-supervised manifold alignment methods. The first method, Shortest Paths on the Union of Domains (SPUD), forms a unified graph structure using known correspondences to establish graph edges. By learning inter-domain geodesic distances, SPUD creates a global, multi-domain structure. The second method, MASH (Manifold Alignment via Stochastic Hopping), learns local geometry within each domain and forms a joint diffusion operator using known correspondences to iteratively learn new inter-domain correspondences through a random-walk approach. Through the diffusion process, MASH forms a coupling matrix that links heterogeneous domains into a unified structure. We compare SPUD and MASH with existing semi-supervised manifold alignment methods and show that they outperform competing methods in aligning true correspondences and cross-domain classification. In addition, we show how these methods can be applied to transfer label information between domains.

Paper Structure

This paper contains 13 sections, 4 equations, 4 figures.

Figures (4)

  • Figure 1: The MDS embedding of the MASH alignment of the Seeds dataset using the potential distance. The two domains are split based on the skewed split described in Section\ref{['sec:splits']}. The domain consisting of meaningful features is denoted by triangles and that of less relevant features by circles. The three colors denote the classes present in the dataset. Known anchor points, 5% of the data, are colored black with gray lines connecting the corresponding anchor points. Shorter lines mean better alignment. Most of the randomly determined anchoring points belong to the blue or green classes, suggesting a reason for the divergence of the orange class branch. The embedding CE score is 88% and its FOSCTTM score is 11%.
  • Figure 2: The MDS embedding of the MASH alignment of the breast cancer dataset where the 4 features with the greatest classification importance are given to Domain A, and the other five are given to Domain B. Domain A is denoted by triangles and Domain B is denoted by circles. Blue and orange denote two classes present in the dataset: whether or not the patient has breast cancer. In this example, we use the labeled information in Domain A (which has 69 data points) to predict labels across Domain B (which has 699 data points). It has an accuracy score of 97% whereas the baseline test score is only 93%.
  • Figure 3: Each of the methods is compared across each split type. Our methods are denoted by dots. The methods are scored according to their average combined score (CE - FOSCTTM). The results are aggregated across all 29 datasets and 10 repetitions. A grid of parameters was searched for each method and the best set of parameters per method, dataset, split, and random state was used and recorded, testing each dataset with a randomized selection of 20% of the points as anchors. Each feature-based split is dominated by SPUD, followed by MASH, MASH-, and MAGAN. DTA outperforms other methods at the distortion adaptation followed by JLMA and our methods. MASH- is the best-performing method for the rotated data, followed by DTA, MAPA, and MASH.
  • Figure 4: Here we present the combined metric score aggregated at each percentage of known correspondence. We split the figure into two parts: (Top) feature level splits and (Bottom) distortion adaptations. The feature-level splits are dominated by SPUD across all levels. MASH- outperforms other methods at all levels at or above 10% on distorted data. Our methods are more reliant on known correspondence than other methods at the distortion splits but do comparably well or better given at least 10% of correspondences are known. We note that MAGAN is likely overfitting the correspondence (similar to mode collapse), leading to worse alignments at higher percentages.