Asymmetric canonical correlation analysis of Riemannian and high-dimensional data

James Buenfil; Eardi Lila

Asymmetric canonical correlation analysis of Riemannian and high-dimensional data

James Buenfil, Eardi Lila

TL;DR

A reformulation of canonical correlation analysis is employed that enables efficient control of the complexity of the functional canonical directions using tangent space sieve approximations and enforce an interpretable group structure on the high-dimensional canonical directions via a sparsity-promoting penalty.

Abstract

In this paper, we introduce a novel statistical model for the integrative analysis of Riemannian-valued functional data and high-dimensional data. We apply this model to explore the dependence structure between each subject's dynamic functional connectivity -- represented by a temporally indexed collection of positive definite covariance matrices -- and high-dimensional data representing lifestyle, demographic, and psychometric measures. Specifically, we employ a reformulation of canonical correlation analysis that enables efficient control of the complexity of the functional canonical directions using tangent space sieve approximations. Additionally, we enforce an interpretable group structure on the high-dimensional canonical directions via a sparsity-promoting penalty. The proposed method shows improved empirical performance over alternative approaches and comes with theoretical guarantees. Its application to data from the Human Connectome Project reveals a dominant mode of covariation between dynamic functional connectivity and lifestyle, demographic, and psychometric measures. This mode aligns with results from static connectivity studies but reveals a unique temporal non-stationary pattern that such studies fail to capture.

Asymmetric canonical correlation analysis of Riemannian and high-dimensional data

TL;DR

Abstract

Paper Structure (53 sections, 47 theorems, 316 equations, 4 figures, 1 table, 2 algorithms)

This paper contains 53 sections, 47 theorems, 316 equations, 4 figures, 1 table, 2 algorithms.

Introduction
Model
Elements of Riemannian geometry
Modeling Riemannian-valued data
Asymmetric Riemannian CCA
On the existence of canonical directions and connections with partial least-squares
Estimation
Selection of hyperparameters
Special instances
Theory
Estimation error rates for Asymmetric Sparse CCA
Estimation error rates for canonical directions from Asymmetric Sparse-Functional CCA
Estimation error rates for canonical variables from Asymmetric Sparse-Functional CCA
Application to dynamic functional connectivity
Data and preprocessing
...and 38 more sections

Key Result

Theorem 2.1

Under Assumption remark:finite_dim_assumption, the CCA model in equation (eq:cca_intuition) admits at most ${d^{(\text{corr})}}$ nontrivial canonical variable pairs $\{(U_k,V_k)\}$, and each pair $(U_k,V_k)$ can be written in terms of the associated canonical directions: $U_k = \llangle \operatornam and let be an eigendecomposition of $B^{\top}\Sigma_X B$. Define Then, the $k$th column of $H$, $

Figures (4)

Figure 1: In this figure, we illustrate the process of projecting the Riemannian-valued functional data and the high-dimensional data to define maximally correlated variables. We leverage tools from differential geometry to compute linear tangent representations $\text{Log}_{\mu} y$ of the temporally-indexed Riemannian-valued data $y$, which are equipped with a notion of inner product $\llangle \cdot, \cdot \rrangle_\mu$, that is, a projection operator. For the multivariate data, we use the conventional notion of projection, i.e., the Euclidean inner product. We therefore seek $\psi$ and $\theta$ whose respective data projections define maximally correlated variables.
Figure 2: This figure illustrates the first mode of covariation between dynamic connectivity and behavioral measures. On the top panel, we show $\left(\operatorname{Exp}_{\hat{\mu}}\left( -c \hat{\psi}_1 \right), -c \hat{\theta}_1 \right)$, which we refer to as 'First CCA Mode +', on the bottom panel we show $\left(\operatorname{Exp}_{\hat{\mu}}\left( +c \hat{\psi}_1 \right), +c \hat{\theta}_1 \right)$, which we refer to as 'First CCA Mode -'. These represent two extremities of the spectrum identified by the first mode of covariation. Within each panel, we show the canonical function of SPD covariances $\operatorname{Exp}_{\hat{\mu}}\left( \pm \hat{\psi}_1 \right)$ at three different times, and a subset of the selected entries of the canonical vector $\pm \hat{\theta}_1$. The depicted mode of covariation suggests that subjects with an increasing variance over time within the visual (VIS) and default mode (DFM) functional systems, as well as an increasing covariance between these systems, positively correlate with higher scores in 'ProcSpeed_ AgeAdj' -- assessing processing speed -- and 'PicVocab_ AgeAdj' -- evaluating language/vocabulary comprehension and negatively correlate with using cannabis and opiates (variables THC, SSAGA_ Mj_ Use, and SSAGA_ Times_ Used_ Opiates).
Figure 3: On the left panel, for both 'First CCA Mode -' and 'First CCA Mode +', we show the temporal dynamics of selected entries of the dynamic mode of connectivity shown in Figure \ref{['fig:codimension']}. Notably, some of these, e.g., the DFM-PCC covariance, remain stationary for both 'First CCA Mode +' and 'First CCA Mode -', while others, e.g., the DFM-VIS covariance, have markedly different patterns. On the right panel, we show a complete list of the 39 variables, of the canonical vector $\pm \hat{\theta}_1$, selected by the proposed model out of an initial set of 150, along with their relative importance.
Figure 4: (Top left): Performance evaluation using metric A, which measures the normalized Euclidean error in the first high-dimensional canonical vector, on approaches 1-3. (Top right): Performance evaluation using metric C, which is the parallel transport error in the first canonical function, on approaches 1-3. (Bottom left): Performance evaluation using metric B, the F1-score of the first estimated high-dimensional canonical vector compared to the associated population vector, on approaches 1-3. (Bottom right): Performance evaluation using out-of-sample correlations. We use out-of-sample tangent correlation (metric D) for approaches 1-3, and out-of-sample Euclidean correlation (metric E) for approach 4.

Theorems & Definitions (63)

Theorem 2.1
Theorem 4.1
Theorem 4.2
Theorem 4.3
Theorem B.1
Remark 1
Lemma B.1
Remark 2
Remark 3
Lemma B.2
...and 53 more

Asymmetric canonical correlation analysis of Riemannian and high-dimensional data

TL;DR

Abstract

Asymmetric canonical correlation analysis of Riemannian and high-dimensional data

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (63)