Table of Contents
Fetching ...

Kernel spectral joint embeddings for high-dimensional noisy datasets using duo-landmark integral operators

Xiucai Ding, Rong Ma

TL;DR

A kernel spectral method is proposed that achieves joint embeddings of two independently observed high-dimensional noisy datasets and establishes the convergence of the embeddings to the eigenfunctions of some natural integral operators.

Abstract

Integrative analysis of multiple heterogeneous datasets has become standard practice in many research fields, especially in single-cell genomics and medical informatics. Existing approaches oftentimes suffer from limited power in capturing nonlinear structures, insufficient account of noisiness and effects of high-dimensionality, lack of adaptivity to signals and sample sizes imbalance, and their results are sometimes difficult to interpret. To address these limitations, we propose a novel kernel spectral method that achieves joint embeddings of two independently observed high-dimensional noisy datasets. The proposed method automatically captures and leverages possibly shared low-dimensional structures across datasets to enhance embedding quality. The obtained low-dimensional embeddings can be utilized for many downstream tasks such as simultaneous clustering, data visualization, and denoising. The proposed method is justified by rigorous theoretical analysis. Specifically, we show the consistency of our method in recovering the low-dimensional noiseless signals, and characterize the effects of the signal-to-noise ratios on the rates of convergence. Under a joint manifolds model framework, we establish the convergence of ultimate embeddings to the eigenfunctions of some newly introduced integral operators. These operators, referred to as duo-landmark integral operators, are defined by the convolutional kernel maps of some reproducing kernel Hilbert spaces (RKHSs). These RKHSs capture the either partially or entirely shared underlying low-dimensional nonlinear signal structures of the two datasets. Our numerical experiments and analyses of two single-cell omics datasets demonstrate the empirical advantages of the proposed method over existing methods in both embeddings and several downstream tasks.

Kernel spectral joint embeddings for high-dimensional noisy datasets using duo-landmark integral operators

TL;DR

A kernel spectral method is proposed that achieves joint embeddings of two independently observed high-dimensional noisy datasets and establishes the convergence of the embeddings to the eigenfunctions of some natural integral operators.

Abstract

Integrative analysis of multiple heterogeneous datasets has become standard practice in many research fields, especially in single-cell genomics and medical informatics. Existing approaches oftentimes suffer from limited power in capturing nonlinear structures, insufficient account of noisiness and effects of high-dimensionality, lack of adaptivity to signals and sample sizes imbalance, and their results are sometimes difficult to interpret. To address these limitations, we propose a novel kernel spectral method that achieves joint embeddings of two independently observed high-dimensional noisy datasets. The proposed method automatically captures and leverages possibly shared low-dimensional structures across datasets to enhance embedding quality. The obtained low-dimensional embeddings can be utilized for many downstream tasks such as simultaneous clustering, data visualization, and denoising. The proposed method is justified by rigorous theoretical analysis. Specifically, we show the consistency of our method in recovering the low-dimensional noiseless signals, and characterize the effects of the signal-to-noise ratios on the rates of convergence. Under a joint manifolds model framework, we establish the convergence of ultimate embeddings to the eigenfunctions of some newly introduced integral operators. These operators, referred to as duo-landmark integral operators, are defined by the convolutional kernel maps of some reproducing kernel Hilbert spaces (RKHSs). These RKHSs capture the either partially or entirely shared underlying low-dimensional nonlinear signal structures of the two datasets. Our numerical experiments and analyses of two single-cell omics datasets demonstrate the empirical advantages of the proposed method over existing methods in both embeddings and several downstream tasks.
Paper Structure (42 sections, 19 theorems, 193 equations, 16 figures, 1 algorithm)

This paper contains 42 sections, 19 theorems, 193 equations, 16 figures, 1 algorithm.

Key Result

Proposition 3.4

Suppose Assumptions assum_signal and assum_commonstructureassumption hold. Recall (eq_reducedmapping). If the kernels in Definition defn_clmd are properly defined in the sense that $d'>0,$ then we can rewrite Moreover, the above kernels are bounded and positive definite.

Figures (16)

  • Figure 1: Illustration of multiple independently observed datasets with potentially shared information. Each dataset contains the same features but possibly different number of samples.
  • Figure 2: Illustration of the joint manifolds model. Here $\iota_1(\mathcal{M}_1)$ and $\iota_2(\mathcal{M}_2)$ contain partially overlapped or identical structures. This generalizes the common manifold model considered in ding2021kerneltalmon2019latent.
  • Figure C: Comparison of simultaneous clustering performance of 7 different approaches using Rand index. The parameter $\tau$ indicates strength of added structural discrepancy between the two datasets. Left: simulation Setting 1 with identical cluster structures. Right: simulation Setting 2 with partially overlapping clusters. The proposed method does the best in leveraging the (even partially) shared cluster patterns across datasets to improve clustering of each individual dataset.
  • Figure D: Comparison of nonlinear manifold learning performance of 7 different approaches. Left: concordance measures under various sample sizes. Right: running time (minutes) comparison of three related algorithms "prop", "rl" and "lbdm," that achieved relatively better performance, showing competent scalability of "prop." The proposed algorithm has overall the best performance in retrieving the torus structures from the noisy dataset $\{\mathbf{y}_i\}$. Our results suggest the advantages of integrative embedding methods (e.g., "prop") over non-integrative embedding method ("pca" and "kpca"), by leveraging the shared information from the external, cleaner dataset $\{\bm{x}_i\}$.
  • Figure E: Comparison of eight methods for simultaneous biclustering of single-cell datasets. Each boxplot contains the Rand index for clustering accuracy obtained under various embedding dimensions ($\mathsf r$ from 3 to 20). Left: single-cell RNA-seq data for human peripheral blood mononuclear cells kang2018multiplexed. Right: single-cell ATAC-seq gene activity data for mouse brain cells luecken2022benchmarking. The proposed method not only achieves better clustering accuracy compared with other methods, but also shows smaller variability and therefore robustness with respect to different choices of embedding dimensions.
  • ...and 11 more figures

Theorems & Definitions (51)

  • Definition 1.1: Stochastic domination
  • Definition 3.3: Convolutional landmark kernels
  • Proposition 3.4
  • Definition 3.5: Duo-landmark integral operators
  • Proposition 3.6
  • Remark 3.7
  • Theorem 3.8
  • Remark 3.11
  • Theorem 3.12
  • Corollary 3.13
  • ...and 41 more