Sparse inverse Cholesky factorization of dense kernel matrices by greedy conditional selection
Stephen Huan, Joseph Guinness, Matthias Katzfuss, Houman Owhadi, Florian Schäfer
TL;DR
This work tackles the computational bottleneck of Gaussian process inference with dense kernel matrices by constructing sparse inverse Cholesky factors through KL-minimization. It introduces greedy conditional selection, which chooses conditioning points to maximize mutual information with targets while accounting for previously selected points, and extends this to multiple targets and aggregated Cholesky factors. The approach achieves substantial complexity reductions (from $O(N k^4)$ to $O(N k^2)$ for single targets, and $O(N k^2 + N m^2 + m^3)$ for $m$ targets) and demonstrates improved accuracy over geometry-based sparsity, with applications spanning Cholesky factorization, GP regression, and preconditioning. Empirical results on GP tasks (SARCOS, OCO-2) and image classification show robust gains in KL divergence, posterior coverage, and RMSE, highlighting practical impact for scalable kernel methods and related numerical linear algebra tasks.
Abstract
Dense kernel matrices resulting from pairwise evaluations of a kernel function arise naturally in machine learning and statistics. Previous work in constructing sparse approximate inverse Cholesky factors of such matrices by minimizing Kullback-Leibler divergence recovers the Vecchia approximation for Gaussian processes. These methods rely only on the geometry of the evaluation points to construct the sparsity pattern. In this work, we instead construct the sparsity pattern by leveraging a greedy selection algorithm that maximizes mutual information with target points, conditional on all points previously selected. For selecting $k$ points out of $N$, the naive time complexity is $\mathcal{O}(N k^4)$, but by maintaining a partial Cholesky factor we reduce this to $\mathcal{O}(N k^2)$. Furthermore, for multiple ($m$) targets we achieve a time complexity of $\mathcal{O}(N k^2 + N m^2 + m^3)$, which is maintained in the setting of aggregated Cholesky factorization where a selected point need not condition every target. We apply the selection algorithm to image classification and recovery of sparse Cholesky factors. By minimizing Kullback-Leibler divergence, we apply the algorithm to Cholesky factorization, Gaussian process regression, and preconditioning with the conjugate gradient, improving over $k$-nearest neighbors selection.
