Table of Contents
Fetching ...

Sparse inverse Cholesky factorization of dense kernel matrices by greedy conditional selection

Stephen Huan, Joseph Guinness, Matthias Katzfuss, Houman Owhadi, Florian Schäfer

TL;DR

This work tackles the computational bottleneck of Gaussian process inference with dense kernel matrices by constructing sparse inverse Cholesky factors through KL-minimization. It introduces greedy conditional selection, which chooses conditioning points to maximize mutual information with targets while accounting for previously selected points, and extends this to multiple targets and aggregated Cholesky factors. The approach achieves substantial complexity reductions (from $O(N k^4)$ to $O(N k^2)$ for single targets, and $O(N k^2 + N m^2 + m^3)$ for $m$ targets) and demonstrates improved accuracy over geometry-based sparsity, with applications spanning Cholesky factorization, GP regression, and preconditioning. Empirical results on GP tasks (SARCOS, OCO-2) and image classification show robust gains in KL divergence, posterior coverage, and RMSE, highlighting practical impact for scalable kernel methods and related numerical linear algebra tasks.

Abstract

Dense kernel matrices resulting from pairwise evaluations of a kernel function arise naturally in machine learning and statistics. Previous work in constructing sparse approximate inverse Cholesky factors of such matrices by minimizing Kullback-Leibler divergence recovers the Vecchia approximation for Gaussian processes. These methods rely only on the geometry of the evaluation points to construct the sparsity pattern. In this work, we instead construct the sparsity pattern by leveraging a greedy selection algorithm that maximizes mutual information with target points, conditional on all points previously selected. For selecting $k$ points out of $N$, the naive time complexity is $\mathcal{O}(N k^4)$, but by maintaining a partial Cholesky factor we reduce this to $\mathcal{O}(N k^2)$. Furthermore, for multiple ($m$) targets we achieve a time complexity of $\mathcal{O}(N k^2 + N m^2 + m^3)$, which is maintained in the setting of aggregated Cholesky factorization where a selected point need not condition every target. We apply the selection algorithm to image classification and recovery of sparse Cholesky factors. By minimizing Kullback-Leibler divergence, we apply the algorithm to Cholesky factorization, Gaussian process regression, and preconditioning with the conjugate gradient, improving over $k$-nearest neighbors selection.

Sparse inverse Cholesky factorization of dense kernel matrices by greedy conditional selection

TL;DR

This work tackles the computational bottleneck of Gaussian process inference with dense kernel matrices by constructing sparse inverse Cholesky factors through KL-minimization. It introduces greedy conditional selection, which chooses conditioning points to maximize mutual information with targets while accounting for previously selected points, and extends this to multiple targets and aggregated Cholesky factors. The approach achieves substantial complexity reductions (from to for single targets, and for targets) and demonstrates improved accuracy over geometry-based sparsity, with applications spanning Cholesky factorization, GP regression, and preconditioning. Empirical results on GP tasks (SARCOS, OCO-2) and image classification show robust gains in KL divergence, posterior coverage, and RMSE, highlighting practical impact for scalable kernel methods and related numerical linear algebra tasks.

Abstract

Dense kernel matrices resulting from pairwise evaluations of a kernel function arise naturally in machine learning and statistics. Previous work in constructing sparse approximate inverse Cholesky factors of such matrices by minimizing Kullback-Leibler divergence recovers the Vecchia approximation for Gaussian processes. These methods rely only on the geometry of the evaluation points to construct the sparsity pattern. In this work, we instead construct the sparsity pattern by leveraging a greedy selection algorithm that maximizes mutual information with target points, conditional on all points previously selected. For selecting points out of , the naive time complexity is , but by maintaining a partial Cholesky factor we reduce this to . Furthermore, for multiple () targets we achieve a time complexity of , which is maintained in the setting of aggregated Cholesky factorization where a selected point need not condition every target. We apply the selection algorithm to image classification and recovery of sparse Cholesky factors. By minimizing Kullback-Leibler divergence, we apply the algorithm to Cholesky factorization, Gaussian process regression, and preconditioning with the conjugate gradient, improving over -nearest neighbors selection.
Paper Structure (61 sections, 30 equations, 33 figures, 12 algorithms)

This paper contains 61 sections, 30 equations, 33 figures, 12 algorithms.

Figures (33)

  • Figure 1: An illustration of the screening effect with the Matérn kernel with length scale $\ell = 1$ and smoothness $\nu = 1/2$. The first panel shows the unconditional correlation with the point at (0, 0). The second panel shows the conditional correlation after conditioning on the four points in orange.
  • Figure 1: Here, the blue points are the candidates, the orange point is the target point to predict at, and the green points are the selected points. The red line is the conditional mean$\mu$, conditional on the selected points, and the $\pm 2 \sigma$confidence interval is shaded for the conditional variance$\sigma^2$. Each method has a budget of two points; the left panel shows selection by Euclidean distance and the right by conditional variance. Euclidean distance prefers the two points right of the target. However, a more balanced view of the situation is obtained when picking the slightly farther but more informative point to the left, reducing variance at the target and thereby reducing predictive error.
  • Figure 1: For a column of a Cholesky factor in isolation, the target point is the diagonal entry, candidates are below it, and the selected entries are added to the sparsity pattern. Points violating lower triangularity are not shown. Thus, sparsity selection in Cholesky factorization (left panel) is analogous to training point selection in directed Gaussian process regression (right panel).
  • Figure 1: Accuracy (left) and computational time (right) of Cholesky factorization methods with varying number of points $N$ and fixed density $\rho = 2$. "$\rho$-ball" is the baseline from schafer2021sparse, "$k$-NN" is selection by $k$-nearest neighbors, "select" is conditional selection, and "(agg.)" denotes aggregation.
  • Figure 1: Illustration of the Cholesky factorization of a partially conditioned covariance matrix. Here grey denotes fully unconditional, blue denotes fully conditional, and the mixed color denotes interaction between the two. Surprisingly, such a matrix factors into a "pure" Cholesky factor by "gluing" the prefix of the fully unconditional factor with the suffix of the fully conditional factor.
  • ...and 28 more figures

Theorems & Definitions (4)

  • Proof 1: Proof of Equation \ref{['eq:obj_chol']}
  • Proof 2: Proof of Equation \ref{['eq:obj_mult']}
  • Proof 3: Proof of Equation \ref{['eq:partial_kl']}
  • Proof 4: Proof of Equation \ref{['eq:greedy_mult']}