Table of Contents
Fetching ...

Conjugate gradient methods for high-dimensional GLMMs

Andrea Pandolfi, Omiros Papaspiliopoulos, Giacomo Zanella

TL;DR

This work addresses the computational bottleneck in fitting high-dimensional GLMMs, where posterior precision ${\boldsymbol{Q}}$ combines a diagonal prior with a dense likelihood contribution ${\boldsymbol{V}}^{T} {\boldsymbol{\Omega}} {\boldsymbol{V}}$. It shows that exact Cholesky factorization becomes impractical for typical crossed designs because of fill-ins, and proposes a conjugate gradient (CG) sampler that solves ${\boldsymbol{Q}}{\boldsymbol{\theta}}={\boldsymbol{b}}$ to draw approximate Gaussian samples, with a total cost that scales linearly in $N$ and $p$ under realistic spectra. The paper develops a detailed spectral analysis, including Jacobi preconditioning and connections to the adjacency structure of a multi-partite graph, to prove that CG converges in a dimension-free number of iterations for many designs, while identifying graph-structural conditions under which convergence deteriorates. Numerical experiments on real and large-scale simulated data (including voter turnout, instructor evaluations, and MovieLens-25M) confirm the theory: CG often matches or outperforms Cholesky in cost and scales to millions of observations, though nested designs can slow convergence. The results provide principled guidelines on when CG-based sampling is advantageous and offer a framework tying linear-algebraic performance to the underlying graph structure of crossed random effects.

Abstract

Generalized linear mixed models (GLMMs) are a widely used tool in statistical analysis. The main bottleneck of many computational approaches lies in the inversion of the high dimensional precision matrices associated with the random effects. Such matrices are typically sparse; however, the sparsity pattern resembles a multi partite random graph, which does not lend itself well to default sparse linear algebra techniques. Notably, we show that, for typical GLMMs, the Cholesky factor is dense even when the original precision is sparse. We thus turn to approximate iterative techniques, in particular to the conjugate gradient (CG) method. We combine a detailed analysis of the spectrum of said precision matrices with results from random graph theory to show that CG-based methods applied to high-dimensional GLMMs typically achieve a fixed approximation error with a total cost that scales linearly with the number of parameters and observations. Numerical illustrations with both real and simulated data confirm the theoretical findings, while at the same time illustrating situations, such as nested structures, where CG-based methods struggle.

Conjugate gradient methods for high-dimensional GLMMs

TL;DR

This work addresses the computational bottleneck in fitting high-dimensional GLMMs, where posterior precision combines a diagonal prior with a dense likelihood contribution . It shows that exact Cholesky factorization becomes impractical for typical crossed designs because of fill-ins, and proposes a conjugate gradient (CG) sampler that solves to draw approximate Gaussian samples, with a total cost that scales linearly in and under realistic spectra. The paper develops a detailed spectral analysis, including Jacobi preconditioning and connections to the adjacency structure of a multi-partite graph, to prove that CG converges in a dimension-free number of iterations for many designs, while identifying graph-structural conditions under which convergence deteriorates. Numerical experiments on real and large-scale simulated data (including voter turnout, instructor evaluations, and MovieLens-25M) confirm the theory: CG often matches or outperforms Cholesky in cost and scales to millions of observations, though nested designs can slow convergence. The results provide principled guidelines on when CG-based sampling is advantageous and offer a framework tying linear-algebraic performance to the underlying graph structure of crossed random effects.

Abstract

Generalized linear mixed models (GLMMs) are a widely used tool in statistical analysis. The main bottleneck of many computational approaches lies in the inversion of the high dimensional precision matrices associated with the random effects. Such matrices are typically sparse; however, the sparsity pattern resembles a multi partite random graph, which does not lend itself well to default sparse linear algebra techniques. Notably, we show that, for typical GLMMs, the Cholesky factor is dense even when the original precision is sparse. We thus turn to approximate iterative techniques, in particular to the conjugate gradient (CG) method. We combine a detailed analysis of the spectrum of said precision matrices with results from random graph theory to show that CG-based methods applied to high-dimensional GLMMs typically achieve a fixed approximation error with a total cost that scales linearly with the number of parameters and observations. Numerical illustrations with both real and simulated data confirm the theoretical findings, while at the same time illustrating situations, such as nested structures, where CG-based methods struggle.

Paper Structure

This paper contains 40 sections, 13 theorems, 66 equations, 5 figures, 8 tables, 5 algorithms.

Key Result

Theorem 1

Denoting with $\mathop{\mathrm{Cost}}\nolimits(\mathrm{Chol})$ the number of floating point operations (flops) needed to compute the Cholesky factor ${\boldsymbol{L}}$ of a positive definite matrix ${\boldsymbol{Q}}$, we have

Figures (5)

  • Figure 1: $\mathop{\mathrm{Cost}}\nolimits(\mathrm{Chol})$ (red dots) as a function of $p$ ($x$-axis) in a log-log scale, for scenarios (a)--(c) described above. The purple triangles represent the cost of the CG sampler, which is discussed in Section \ref{['sec:simulated']}. ${\boldsymbol{Q}}$ is obtained by setting $\tau = 1$ and ${\boldsymbol{T}} = {\boldsymbol{I}} _p$.
  • Figure 2: The two histograms show the spectrum of ${\boldsymbol{Q}}$ (left) and $\bar{{\boldsymbol{Q}}}$ (right). They are obtained by setting ${\boldsymbol{T}}= {\boldsymbol{I}} _p, \ \tau = 1$, $G_1 = 300$, $G_2 = 1000$, $G_3 = 2000$. The dashed lines on the left panel correspond to the values $T_k + \tau N /G_k$, for $k=1,2,3$.
  • Figure 3: CI graph of Example \ref{['ex:worst_sla']} obtained for $d=2$ and $G=7$. The two panels illustrate how to obtain a path from $\theta_{25}$ to $\theta_{23}$ following the procedure explained in the proof of Proposition \ref{['prop:unfavourable']} above. In the notation of the proof, $j=5$ and $r(j) = 2 \leq m =3$. The figure on the left shows the path from $\theta_{25}$ to $\theta_{21}$ that goes through $\theta_{2\, r(5)}$ and $\theta_{2\, r(r(5))}$, while the figure on the right, shows the path from $\theta_{23}$ to $\theta_{21}$.
  • Figure 4: Precision matrix ${\boldsymbol{Q}}$ of $\mathcal{L}({\boldsymbol{\theta}} \mid {\boldsymbol{y}}, {\boldsymbol{g}})$ induced by the structured design of Example \ref{['ex:worst_sla']} with $G=20$, $d=3$ and default ordering $(\theta_{1,1}, \dots, \theta_{1,G}, \theta_{2,1}, \dots, \theta_{2,G}, \theta_0)$.
  • Figure 5: Conditional independence graph of Example \ref{['ex:pairwise_non_connected']}. Obtained for $G_1 = G_2 = G_3 = 3$, and $N= 9$.

Theorems & Definitions (30)

  • Theorem 1
  • Theorem 2
  • Proposition 1
  • Example 1
  • Proposition 2
  • Remark 1: CG preconditioners
  • Theorem 3
  • Corollary 1
  • Lemma 1
  • Remark 2
  • ...and 20 more