Some new techniques to use in serial sparse Cholesky factorization algorithms
M. Ozan Karsavuran, Esmond G. Ng, Barry W. Peyton, Jonathan L. Peyton
TL;DR
This work addresses efficient serial sparse Cholesky factorization for large sparse SPD systems by comparing MF, LL, RL, and introducing a fourth variant, RLB. The main approach leverages supernode structure and column reordering within supernodes (PR) to create fewer, larger dense blocks, enabling heavy use of BLAS kernels with minimal FP work storage. The key findings show that RL is simpler and modestly faster than MF, while RLB—especially when preceded by PR reordering—consistently outperforms all others and uses the least floating-point storage, achieving large speedups with multithreaded BLAS. The study demonstrates that parallel performance for sparse Cholesky can be realized on multi-core CPUs using BLAS-based updates while avoiding assembly overhead, though MF retains strength for out-of-core contexts.
Abstract
We present a new variant of serial right-looking supernodal sparse Cholesky factorization (RL). Our comparison of RL with the multifrontal method confirms that RL is simpler, slightly faster, and requires slightly less storage. The key to the rest of the work in this paper is recent work on reordering columns within supernodes so that the dense off-diagonal blocks in the factor matrix joining pairs of supernodes are fewer and larger. We present a second new variant of serial right-looking supernodal sparse Cholesky factorization (RLB), where this one is specifically designed to exploit fewer and larger off-diagonal blocks in the factor matrix obtained by reordering within supernodes. A key distinction found in RLB is that it uses no floating-point working storage and performs no assembly operations. Our key finding is that RLB is unequivocally faster than its competitors. Indeed, RLB is consistently, but modestly, faster than its competitors whenever Intel's MKL sequential BLAS are used. More importantly, RLB is substantially faster than its competitors whenever Intel's MKL multithreaded BLAS are used. Finally, RLB using the multithreaded BLAS achieves impressive speedups over RLB using the sequential BLAS.
