Table of Contents
Fetching ...

A Reexamination of the COnfLUX 2.5D LU Factorization Algorithm

Yuan Tang

TL;DR

The paper reexamines the COnfLUX variant of the 2.5D LU factorization with tournament pivoting, addressing questions about its communication bandwidth bound, empirical validation, and lower-bound derivation. It argues that the original upper-bound analysis underestimates costs due to a 1D decomposition that limits active processor participation, and it provides a corrected bandwidth lower bound of $\Omega(n^2/p^{1/2})$ (or $\Omega(n^2/p^{1/3})$ under certain configurations). It also shows that the empirical study tested only a limited set of processor grids and did not evaluate the claimed optimal configuration, and it criticizes the lower-bound derivation for neglecting$p$-dependent I/O growth and partial participation. Overall, the work aims to enhance understanding and spur more rigorous analyses in parallel matrix factorization, with implications for designing and evaluating bandwidth-optimal LU algorithms.

Abstract

This article conducts a reexamination of the research conducted by Kwasniewski et al., focusing on their adaptation of the 2.5D LU factorization algorithm with tournament pivoting, known as \func{COnfLUX}. Our reexamination reveals potential concerns regarding the upper bound, empirical investigation methods, and lower bound, despite the original study providing a theoretical foundation and an instantiation of the proposed algorithm. This paper offers a reexamination of these matters, highlighting probable shortcomings in the original investigation. Our observations are intended to enhance the development and comprehension of parallel matrix factorization algorithms.

A Reexamination of the COnfLUX 2.5D LU Factorization Algorithm

TL;DR

The paper reexamines the COnfLUX variant of the 2.5D LU factorization with tournament pivoting, addressing questions about its communication bandwidth bound, empirical validation, and lower-bound derivation. It argues that the original upper-bound analysis underestimates costs due to a 1D decomposition that limits active processor participation, and it provides a corrected bandwidth lower bound of (or under certain configurations). It also shows that the empirical study tested only a limited set of processor grids and did not evaluate the claimed optimal configuration, and it criticizes the lower-bound derivation for neglecting-dependent I/O growth and partial participation. Overall, the work aims to enhance understanding and spur more rigorous analyses in parallel matrix factorization, with implications for designing and evaluating bandwidth-optimal LU algorithms.

Abstract

This article conducts a reexamination of the research conducted by Kwasniewski et al., focusing on their adaptation of the 2.5D LU factorization algorithm with tournament pivoting, known as \func{COnfLUX}. Our reexamination reveals potential concerns regarding the upper bound, empirical investigation methods, and lower bound, despite the original study providing a theoretical foundation and an instantiation of the proposed algorithm. This paper offers a reexamination of these matters, highlighting probable shortcomings in the original investigation. Our observations are intended to enhance the development and comprehension of parallel matrix factorization algorithms.
Paper Structure (5 sections, 4 equations, 7 figures)

This paper contains 5 sections, 4 equations, 7 figures.

Figures (7)

  • Figure 1: Description of using 1D decomposition for the $A_{10}$ and $A_{01}$ regions of LU -- in Sect. 7.2 of original paper KwasniewskiKaBe21
  • Figure 2: Snapshot from original "lu_params.hpp" of code base showing that its processor grid setting is $\sqrt{p} \times \sqrt{p} \times 1$ or $\sqrt{p/2} \times \sqrt{p/2} \times 2$
  • Figure 3: Snapshot from original "conflux_opt.hpp" of code base showing that ${\mathop{\operator@font COnfLUX}\nolimits}$ employs at most $\mathit{pi} \cdot \mathit{pk} = p^{1/2}_1 c = O(\sqrt{p})$ processors in the reduction operations of the $A_{10}$ and $A_{01}$ regions.
  • Figure 4: Lemma 7 in Sect. 5 of original paper
  • Figure 5: Lemma 8 in Sect. 7.4 of original paper reveals the processor grid configuration is $p_1^{1/2} \times p_1^{1/2} \times c$.
  • ...and 2 more figures