Table of Contents
Fetching ...

Graph Canonical Correlation Analysis

Hongju Park, Shuyang Bai, Zhenyao Ye, Hwiyoung Lee, Tianzhou Ma, Shuo Chen

TL;DR

This work introduces graph Canonical Correlation Analysis (gCCA), a sparse, graph-aware extension of classical CCA designed for high-dimensional, multi-view data. By modeling cross-dataset associations as a bipartite graph and greedily extracting dense bicliques, gCCA identifies functionally coherent variable modules and estimates canonical correlations within those modules. The authors establish finite-sample guarantees, including a minimum-sample requirement for exact recovery and a square-root rate for correlation estimation, supported by concentration and martingale-based arguments. Empirically, gCCA outperforms sparse CCA in simulations and reveals biologically meaningful methylation-transcriptomics pathways in TCGA-GBM data, including both positive and negative regulatory relationships, with publicly available code for replication.

Abstract

Canonical correlation analysis (CCA) is a widely used technique for estimating associations between two sets of multi-dimensional variables. Recent advancements in CCA methods have expanded their application to decipher the interactions of multiomics datasets, imaging-omics datasets, and more. However, conventional CCA methods are limited in their ability to incorporate structured patterns in the cross-correlation matrix, potentially leading to suboptimal estimations. To address this limitation, we propose the graph Canonical Correlation Analysis (gCCA) approach, which calculates canonical correlations based on the graph structure of the cross-correlation matrix between the two sets of variables. We develop computationally efficient algorithms for gCCA, and provide theoretical results for finite sample analysis of best subset selection and canonical correlation estimation by introducing concentration inequalities and stopping time rule based on martingale theories. Extensive simulations demonstrate that gCCA outperforms competing CCA methods. Additionally, we apply gCCA to a multiomics dataset of DNA methylation and RNA-seq transcriptomics, identifying both positively and negatively regulated gene expression pathways by DNA methylation pathways.

Graph Canonical Correlation Analysis

TL;DR

This work introduces graph Canonical Correlation Analysis (gCCA), a sparse, graph-aware extension of classical CCA designed for high-dimensional, multi-view data. By modeling cross-dataset associations as a bipartite graph and greedily extracting dense bicliques, gCCA identifies functionally coherent variable modules and estimates canonical correlations within those modules. The authors establish finite-sample guarantees, including a minimum-sample requirement for exact recovery and a square-root rate for correlation estimation, supported by concentration and martingale-based arguments. Empirically, gCCA outperforms sparse CCA in simulations and reveals biologically meaningful methylation-transcriptomics pathways in TCGA-GBM data, including both positive and negative regulatory relationships, with publicly available code for replication.

Abstract

Canonical correlation analysis (CCA) is a widely used technique for estimating associations between two sets of multi-dimensional variables. Recent advancements in CCA methods have expanded their application to decipher the interactions of multiomics datasets, imaging-omics datasets, and more. However, conventional CCA methods are limited in their ability to incorporate structured patterns in the cross-correlation matrix, potentially leading to suboptimal estimations. To address this limitation, we propose the graph Canonical Correlation Analysis (gCCA) approach, which calculates canonical correlations based on the graph structure of the cross-correlation matrix between the two sets of variables. We develop computationally efficient algorithms for gCCA, and provide theoretical results for finite sample analysis of best subset selection and canonical correlation estimation by introducing concentration inequalities and stopping time rule based on martingale theories. Extensive simulations demonstrate that gCCA outperforms competing CCA methods. Additionally, we apply gCCA to a multiomics dataset of DNA methylation and RNA-seq transcriptomics, identifying both positively and negatively regulated gene expression pathways by DNA methylation pathways.

Paper Structure

This paper contains 19 sections, 5 theorems, 91 equations, 3 figures, 2 tables, 1 algorithm.

Key Result

lemma 1

Let $R_{ij}$ and $\rho_{ij}$ be the sample correlation coefficient with sample size $n$ and ground truth correlation of $X_i$ and $Y_j$, respectively. Then, for all $i \in [p]$ and $j \in [q]$, with probability at least $1-\delta$, we have

Figures (3)

  • Figure 1: Pipeline to detect associated variables for associations in two different data sets by gCCA. Step 1 shows the correlation matrix calculated from two joint datasets. Step 2 illustrates the subgraphs detected by the greedy algorithm for a specific tuning parameter $\lambda$. Step 3 showcases the two optimal subgraphs based on the optimal tuning parameter and the connections of variables in and outside the subgraphs. Lastly, Step 4 is the calculation of canonical vectors and correlation.
  • Figure 2: Row and column exclusion process by the greedy algorithm under the presence of a subgraph (size: 2 by 2) in a graph with 5 rows and 4 columns. Red and gray cells represent (unknown) associated and irrelevant variables, respectively. Solid red lines indicate the exclusion of the row or column with the lowest row or column mean among all active rows and columns. The table shows that the objective function is maximized at $t=6$ and thereby $J^{1,6} = (\{1,2\}, \{1,2\}, \{e_{11},e_{12},e_{21},e_{22}\} )$ with $A_{ij}=1$ for all $(i,j)$ is considered the extracted biclique subgraph.
  • Figure 3: Heatmaps of sample correlation matrices in the realdata analysis. The leftmost one is the sample correlation matrix. The two middle ones are the reordered correlation matrices by gCCA (top) and sCCA (bottom). The two figures on the rightmost are the extracted subgraphs from the TCGA-GBM data set by gCCA (top) and sCCA (bottom), respectively. The subgraphs extracted by gCCA and sCCA are of sizes 912 by 1793 and 100 by 100, respectively, with canonical correlations of 0.836 for gCCA and 0.743 for sCCA.

Theorems & Definitions (10)

  • lemma 1
  • lemma 2
  • lemma 3
  • theorem 1
  • proof
  • theorem 2
  • proof
  • proof
  • proof
  • proof