Making Multi-Axis Gaussian Graphical Models Scalable to Millions of Samples and Features

Bailey Andrew; David R. Westhead; Luisa Cutillo

Making Multi-Axis Gaussian Graphical Models Scalable to Millions of Samples and Features

Bailey Andrew, David R. Westhead, Luisa Cutillo

TL;DR

This work tackles scalable inference of conditional dependencies in data that do not satisfy sample independence, focusing on multi-axis, tensor-variate datasets. It develops a scalable, independence-free Gaussian graphical model based on a singular Kronecker-sum normal distribution with low-rank axis graphs, achieving $O(n^2)$ time and $O(n)$ space. The approach preserves multi-modality and arbitrary marginals via a Gaussian copula, provides interpretable hyperparameters, and enables edge-wise hypothesis testing using the Fisher information. Demonstrations on synthetic data and large real-world datasets, including a million-cell scRNA-seq PBMC dataset, illustrate significant scalability gains and competitive accuracy relative to prior methods.

Abstract

Gaussian graphical models can be used to extract conditional dependencies between the features of the dataset. This is often done by making an independence assumption about the samples, but this assumption is rarely satisfied in reality. However, state-of-the-art approaches that avoid this assumption are not scalable, with $O(n^3)$ runtime and $O(n^2)$ space complexity. In this paper, we introduce a method that has $O(n^2)$ runtime and $O(n)$ space complexity, without assuming independence. We validate our model on both synthetic and real-world datasets, showing that our method's accuracy is comparable to that of prior work We demonstrate that our approach can be used on unprecedentedly large datasets, such as a real-world 1,000,000-cell scRNA-seq dataset; this was impossible with previous approaches. Our method maintains the flexibility of prior work, such as the ability to handle multi-modal tensor-variate datasets and the ability to work with data of arbitrary marginal distributions. An additional advantage of our method is that, unlike prior work, our hyperparameters are easily interpretable.

Making Multi-Axis Gaussian Graphical Models Scalable to Millions of Samples and Features

TL;DR

time and

space. The approach preserves multi-modality and arbitrary marginals via a Gaussian copula, provides interpretable hyperparameters, and enables edge-wise hypothesis testing using the Fisher information. Demonstrations on synthetic data and large real-world datasets, including a million-cell scRNA-seq PBMC dataset, illustrate significant scalability gains and competitive accuracy relative to prior methods.

Abstract

runtime and

space complexity. In this paper, we introduce a method that has

runtime and

space complexity, without assuming independence. We validate our model on both synthetic and real-world datasets, showing that our method's accuracy is comparable to that of prior work We demonstrate that our approach can be used on unprecedentedly large datasets, such as a real-world 1,000,000-cell scRNA-seq dataset; this was impossible with previous approaches. Our method maintains the flexibility of prior work, such as the ability to handle multi-modal tensor-variate datasets and the ability to work with data of arbitrary marginal distributions. An additional advantage of our method is that, unlike prior work, our hyperparameters are easily interpretable.

Paper Structure (21 sections, 28 theorems, 69 equations, 8 figures, 6 tables, 1 algorithm)

This paper contains 21 sections, 28 theorems, 69 equations, 8 figures, 6 tables, 1 algorithm.

Introduction
Related Work
Our Contributions
Notation
Methods
Assumptions and Their Justification
Derivation of the Method
Identifiability
Convexity, Existence, and Uniqueness
Hypothesis Testing
Practical Implementation
Picking the Hyperparameters
Results
Synthetic Data
Real Data
...and 6 more sections

Key Result

Lemma 1

Figures (8)

Figure 1: Dependencies between (gene, cell) pairs. In the same cell, two genes are connected only if the genes are related to each other. An analogous statement holds for cell-cell relations. Cross terms, such as (A, X) $\sim$ (B, Y), may exist - but they are indirect effects. Conditioning out the mediators, in this case (B, X) and (A, Y), will remove the relation.
Figure 2: Precision-recall curves on graphs of various sizes. On the right, we report how much variance the used eigenvalues account for. For GmGM and TeraLasso, this will always be 1, as they do not make a low-rank assumption. The shaded region around each PR curve corresponds to the maximum and minimum values reported over 10 runs; the central curve is the mean of the maximum and minimum.
Figure 3: Runtimes of various algorithms as the number of nodes in the graph increases. In the top-left, we have focused in on the sub-1000-node region to be able to show TeraLasso and DNNLasso. GmGM-50pcs-minimal corresponds to the case in which we assume the number of edges in the graph is the same as the number of nodes; for the other models, we kept the full graph.
Figure 4: The GmGM algorithm can be split into three parts; calculating the eigenvectors, calculating the eigenvalues, and then combining these two ('recomposition') to produce the final precision matrix. This plot shows the relative amount of time spent in each of the three parts.
Figure 5: A comparison of several methods to find the frame graph.
...and 3 more figures

Theorems & Definitions (28)

Lemma : dahl_network_2013 Lemma 1
Lemma : Cyclic Property andrew_gmgm_2024
Lemma 1: Extraction Property
Lemma 2: Downsampling Property
Lemma 3
Theorem 4
Corollary 5
Corollary 6
Theorem 7
Corollary 8
...and 18 more

Making Multi-Axis Gaussian Graphical Models Scalable to Millions of Samples and Features

TL;DR

Abstract

Making Multi-Axis Gaussian Graphical Models Scalable to Millions of Samples and Features

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (28)