Table of Contents
Fetching ...

Fast and Accurate Graph Learning for Huge Data via Minipatch Ensembles

Tianyi Yao, Minjie Wang, Genevera I. Allen

TL;DR

This work develops the novel Minipatch Graph estimator, which is computationally fast, embarrassingly paral- lelizable, memory efficient, and has integrated stability-based hyperparamter tuning and proves that under weaker assumptions than that of the Graphical Lasso, the MPGraph estimator achieves graph selection consistency.

Abstract

Gaussian graphical models provide a powerful framework for uncovering conditional dependence relationships between sets of nodes; they have found applications in a wide variety of fields including sensor and communication networks, physics, finance, and computational biology. Often, one observes data on the nodes and the task is to learn the graph structure, or perform graphical model selection. While this is a well-studied problem with many popular techniques, there are typically three major practical challenges: i) many existing algorithms become computationally intractable in huge-data settings with tens of thousands of nodes; ii) the need for separate data-driven hyperparameter tuning considerably adds to the computational burden; iii) the statistical accuracy of selected edges often deteriorates as the dimension and/or the complexity of the underlying graph structures increase. We tackle these problems by developing the novel Minipatch Graph (MPGraph) estimator. Our approach breaks up the huge graph learning problem into many smaller problems by creating an ensemble of tiny random subsets of both the observations and the nodes, termed minipatches. We then leverage recent advances that use hard thresholding to solve the latent variable graphical model problem to consistently learn the graph on each minipatch. Our approach is computationally fast, embarrassingly parallelizable, memory efficient, and has integrated stability-based hyperparamter tuning. Additionally, we prove that under weaker assumptions than that of the Graphical Lasso, our MPGraph estimator achieves graph selection consistency. We compare our approach to state-of-the-art computational approaches for Gaussian graphical model selection including the BigQUIC algorithm, and empirically demonstrate that our approach is not only more statistically accurate but also extensively faster for huge graph learning problems.

Fast and Accurate Graph Learning for Huge Data via Minipatch Ensembles

TL;DR

This work develops the novel Minipatch Graph estimator, which is computationally fast, embarrassingly paral- lelizable, memory efficient, and has integrated stability-based hyperparamter tuning and proves that under weaker assumptions than that of the Graphical Lasso, the MPGraph estimator achieves graph selection consistency.

Abstract

Gaussian graphical models provide a powerful framework for uncovering conditional dependence relationships between sets of nodes; they have found applications in a wide variety of fields including sensor and communication networks, physics, finance, and computational biology. Often, one observes data on the nodes and the task is to learn the graph structure, or perform graphical model selection. While this is a well-studied problem with many popular techniques, there are typically three major practical challenges: i) many existing algorithms become computationally intractable in huge-data settings with tens of thousands of nodes; ii) the need for separate data-driven hyperparameter tuning considerably adds to the computational burden; iii) the statistical accuracy of selected edges often deteriorates as the dimension and/or the complexity of the underlying graph structures increase. We tackle these problems by developing the novel Minipatch Graph (MPGraph) estimator. Our approach breaks up the huge graph learning problem into many smaller problems by creating an ensemble of tiny random subsets of both the observations and the nodes, termed minipatches. We then leverage recent advances that use hard thresholding to solve the latent variable graphical model problem to consistently learn the graph on each minipatch. Our approach is computationally fast, embarrassingly parallelizable, memory efficient, and has integrated stability-based hyperparamter tuning. Additionally, we prove that under weaker assumptions than that of the Graphical Lasso, our MPGraph estimator achieves graph selection consistency. We compare our approach to state-of-the-art computational approaches for Gaussian graphical model selection including the BigQUIC algorithm, and empirically demonstrate that our approach is not only more statistically accurate but also extensively faster for huge graph learning problems.

Paper Structure

This paper contains 15 sections, 4 theorems, 24 equations, 4 figures, 4 tables, 2 algorithms.

Key Result

theorem 1

Let Assumptions assumptionMPG1-assumptionMPG6 be satisfied and let $n$ grow proportionally with $N$. Then, the minipatch graph selection estimator, MPGraph, with $\lambda \asymp \sqrt{\frac{\log m}{n}}$ and $\tau \asymp \sqrt{\frac{s \log m}{n}}$, is graph selection consistent with high probability:

Figures (4)

  • Figure 1: Validation of Theoretical Results. (A) Probability of exact edge-set recovery versus total sample size $N$ for chain graph simulations with varying number of nodes $M$. Each point represents the average over $100$ trials. (B) Edge selection accuracy (F1 Score) comparisons for small-world graph simulations with fixed $N=500$ and varying $M$. Note that oracle parameter tuning is used for all methods. (C) Edge selection accuracy of MPGraph using various minipatch sizes (i.e. $m/M$) for the same simulations in (B).
  • Figure 2: Edge Selection Accuracy (F1 Score) and Computational Time from Simulation Scenarios 1-3 for a Variety of Dimensions. Parallelism is enabled for all methods whose software packages include this functionality, as indicated by the (P) after the method name. Our MPGraph method achieves the best edge selection accuracy across all four data sets while being one of the computationally fastest methods.
  • Figure 4: Performance of MPGraph for Small-World Graph Simulations with Fixed $N=500$ and Varying Dimensionality $M$. (A) Edge selection accuracy of MPGraph using various minipatch sizes (i.e. $m/M$) with oracle tuning approaches (i.e. assume total number of true edges $|E|$ is known). (B) Edge selection accuracy of MPGraph using various minipatch sizes (i.e. $m/M$) for the same simulations in (A), but with data-driven tuning approaches.
  • Figure 5: Effects of Minipatch Size. We demonstrate how edge selection accuracy and computational time of MPGraph change with different minipatch sizes in terms of $m/M$ and $m/n$, where $M$ is the total number of nodes. (A) Edge selection accuracy in terms of F1 Score (with data-driven tuning); (B) Computational time on $\log_{10}(\text{second})$ scale. We see that our method has stable edge selection accuracy for a sensible range of $n$ and $m$ values.

Theorems & Definitions (4)

  • theorem 1
  • lemma 1
  • lemma 2
  • theorem 1