Regularised Canonical Correlation Analysis: graphical lasso, biplots and beyond
Lennie Wells, Kumar Thurimella, Sergio Bacallado
TL;DR
The paper addresses the challenge of regularising Canonical Correlation Analysis in high-dimensional settings by proposing Graphical CCA (gCCA), which leverages the Graphical Lasso to estimate a sparse precision matrix encoding conditional independencies between two data views. The authors provide a rigorous theoretical framework linking precision-matrix estimation to accurate canonical subspaces via Lipschitz plug-in arguments and perturbation theory, and they position gCCA relative to ridge and sparse CCA methods. A comprehensive evaluation framework combines oracle and empirical criteria for correlation capture and estimation accuracy, with cross-validation guiding parameter selection and a subspace- and biplot-centric interpretation toolkit. The real-data application on a microbiome dataset demonstrates the practical utility of the framework, revealing that many successive directions carry meaningful signal and that variates are more stable than weights, with biplots and overlap matrices providing intuitive visual diagnostics. Overall, the work offers a principled, interpretable approach to high-dimensional CCA, delivering both theoretical guarantees and practical tools to support exploratory data analysis and wider adoption in complex biological datasets.
Abstract
Recent developments in regularized Canonical Correlation Analysis (CCA) promise powerful methods for high-dimensional, multiview data analysis. However, justifying the structural assumptions behind many popular approaches remains a challenge, and features of realistic biological datasets pose practical difficulties that are seldom discussed. We propose a novel CCA estimator rooted in an assumption of conditional independencies and based on the Graphical Lasso. Our method has desirable theoretical guarantees and good empirical performance, demonstrated through extensive simulations and real-world biological datasets. Recognizing the difficulties of model selection in high dimensions and other practical challenges of applying CCA in real-world settings, we introduce a novel framework for evaluating and interpreting regularized CCA models in the context of Exploratory Data Analysis (EDA), which we hope will empower researchers and pave the way for wider adoption.
