Table of Contents
Fetching ...

Some new techniques to use in serial sparse Cholesky factorization algorithms

M. Ozan Karsavuran, Esmond G. Ng, Barry W. Peyton, Jonathan L. Peyton

TL;DR

This work addresses efficient serial sparse Cholesky factorization for large sparse SPD systems by comparing MF, LL, RL, and introducing a fourth variant, RLB. The main approach leverages supernode structure and column reordering within supernodes (PR) to create fewer, larger dense blocks, enabling heavy use of BLAS kernels with minimal FP work storage. The key findings show that RL is simpler and modestly faster than MF, while RLB—especially when preceded by PR reordering—consistently outperforms all others and uses the least floating-point storage, achieving large speedups with multithreaded BLAS. The study demonstrates that parallel performance for sparse Cholesky can be realized on multi-core CPUs using BLAS-based updates while avoiding assembly overhead, though MF retains strength for out-of-core contexts.

Abstract

We present a new variant of serial right-looking supernodal sparse Cholesky factorization (RL). Our comparison of RL with the multifrontal method confirms that RL is simpler, slightly faster, and requires slightly less storage. The key to the rest of the work in this paper is recent work on reordering columns within supernodes so that the dense off-diagonal blocks in the factor matrix joining pairs of supernodes are fewer and larger. We present a second new variant of serial right-looking supernodal sparse Cholesky factorization (RLB), where this one is specifically designed to exploit fewer and larger off-diagonal blocks in the factor matrix obtained by reordering within supernodes. A key distinction found in RLB is that it uses no floating-point working storage and performs no assembly operations. Our key finding is that RLB is unequivocally faster than its competitors. Indeed, RLB is consistently, but modestly, faster than its competitors whenever Intel's MKL sequential BLAS are used. More importantly, RLB is substantially faster than its competitors whenever Intel's MKL multithreaded BLAS are used. Finally, RLB using the multithreaded BLAS achieves impressive speedups over RLB using the sequential BLAS.

Some new techniques to use in serial sparse Cholesky factorization algorithms

TL;DR

This work addresses efficient serial sparse Cholesky factorization for large sparse SPD systems by comparing MF, LL, RL, and introducing a fourth variant, RLB. The main approach leverages supernode structure and column reordering within supernodes (PR) to create fewer, larger dense blocks, enabling heavy use of BLAS kernels with minimal FP work storage. The key findings show that RL is simpler and modestly faster than MF, while RLB—especially when preceded by PR reordering—consistently outperforms all others and uses the least floating-point storage, achieving large speedups with multithreaded BLAS. The study demonstrates that parallel performance for sparse Cholesky can be realized on multi-core CPUs using BLAS-based updates while avoiding assembly overhead, though MF retains strength for out-of-core contexts.

Abstract

We present a new variant of serial right-looking supernodal sparse Cholesky factorization (RL). Our comparison of RL with the multifrontal method confirms that RL is simpler, slightly faster, and requires slightly less storage. The key to the rest of the work in this paper is recent work on reordering columns within supernodes so that the dense off-diagonal blocks in the factor matrix joining pairs of supernodes are fewer and larger. We present a second new variant of serial right-looking supernodal sparse Cholesky factorization (RLB), where this one is specifically designed to exploit fewer and larger off-diagonal blocks in the factor matrix obtained by reordering within supernodes. A key distinction found in RLB is that it uses no floating-point working storage and performs no assembly operations. Our key finding is that RLB is unequivocally faster than its competitors. Indeed, RLB is consistently, but modestly, faster than its competitors whenever Intel's MKL sequential BLAS are used. More importantly, RLB is substantially faster than its competitors whenever Intel's MKL multithreaded BLAS are used. Finally, RLB using the multithreaded BLAS achieves impressive speedups over RLB using the sequential BLAS.
Paper Structure (14 sections, 20 equations, 6 figures, 2 tables, 5 algorithms)

This paper contains 14 sections, 20 equations, 6 figures, 2 tables, 5 algorithms.

Figures (6)

  • Figure 1: The supernodes of a sparse Cholesky factor $L$. Each symbol '$\ast$' signifies an off-diagonal entry that is nonzero in both $A$ and $L$; each symbol '$+$' signifies an off-diagonal entry that is zero in $A$ but nonzero in $L$---a fill entry in $L$.
  • Figure 2: The supernodes of the sparse Cholesky factor $\widehat{L}$ obtained after a symmetric permutation of supernode $J_3$ in Figure \ref{['fig:supernode1']}. Let $\widehat{A}$ be the new version of $A$ after the symmetric permutation. Each symbol '$\ast$' signifies an off-diagonal entry that is nonzero in both $\widehat{A}$ and $\widehat{L}$; each symbol '$+$' signifies an off-diagonal entry that is zero in $\widehat{A}$ but nonzero in $\widehat{L}$.
  • Figure 3: The sparse Cholesky factor shown in Figure \ref{['fig:supernode1']}, along with its elimination tree.
  • Figure 4: Performance profile for the factorization times for the 21 large matrices whenever the serial BLAS are linked in.
  • Figure 5: Performance profile for the factorization times for the 21 large matrices whenever the multithreaded BLAS are linked in.
  • ...and 1 more figures