A Reexamination of the COnfLUX 2.5D LU Factorization Algorithm
Yuan Tang
TL;DR
The paper reexamines the COnfLUX variant of the 2.5D LU factorization with tournament pivoting, addressing questions about its communication bandwidth bound, empirical validation, and lower-bound derivation. It argues that the original upper-bound analysis underestimates costs due to a 1D decomposition that limits active processor participation, and it provides a corrected bandwidth lower bound of $\Omega(n^2/p^{1/2})$ (or $\Omega(n^2/p^{1/3})$ under certain configurations). It also shows that the empirical study tested only a limited set of processor grids and did not evaluate the claimed optimal configuration, and it criticizes the lower-bound derivation for neglecting$p$-dependent I/O growth and partial participation. Overall, the work aims to enhance understanding and spur more rigorous analyses in parallel matrix factorization, with implications for designing and evaluating bandwidth-optimal LU algorithms.
Abstract
This article conducts a reexamination of the research conducted by Kwasniewski et al., focusing on their adaptation of the 2.5D LU factorization algorithm with tournament pivoting, known as \func{COnfLUX}. Our reexamination reveals potential concerns regarding the upper bound, empirical investigation methods, and lower bound, despite the original study providing a theoretical foundation and an instantiation of the proposed algorithm. This paper offers a reexamination of these matters, highlighting probable shortcomings in the original investigation. Our observations are intended to enhance the development and comprehension of parallel matrix factorization algorithms.
