Table of Contents
Fetching ...

Sparsity-Preserving Encodings for Straggler-Optimal Distributed Matrix Computations at the Edge

Anindya Bijoy Das, Aditya Ramamoorthy, David J. Love, Christopher G. Brinton

TL;DR

This paper tackles the challenge of preserving input sparsity in straggler-resilient distributed matrix computations for edge learning. It introduces a rigorous encoding-weight framework and proves a lower bound $\hat{\omega}$ on the number of submatrices that must participate in any encoded computation, then develops sparsity-preserving MV and MM schemes that meet this bound. The proposed methods extend to heterogeneous edge devices, provide per-device computational and stability analyses, and are validated on AWS with highly sparse matrices, showing faster computation, reduced communication, and improved numerical robustness compared to dense-code baselines. Overall, the work enables faster, more scalable edge learning by combining sparsity-aware coding with straggler resilience, applicable to both MV and MM tasks.

Abstract

Matrix computations are a fundamental building-block of edge computing systems, with a major recent uptick in demand due to their use in AI/ML training and inference procedures. Existing approaches for distributing matrix computations involve allocating coded combinations of submatrices to worker nodes, to build resilience to slower nodes, called stragglers. In the edge learning context, however, these approaches will compromise sparsity properties that are often present in the original matrices found at the edge server. In this study, we consider the challenge of augmenting such approaches to preserve input sparsity when distributing the task across edge devices, thereby retaining the associated computational efficiency enhancements. First, we find a lower bound on the weight of coding, i.e., the number of submatrices to be combined to obtain coded submatrices, to provide the resilience to the maximum possible number of straggler devices (for given number of devices and their storage constraints). Next we propose distributed matrix computation schemes which meet the exact lower bound on the weight of the coding. Numerical experiments conducted in Amazon Web Services (AWS) validate our assertions regarding straggler mitigation and computation speed for sparse matrices.

Sparsity-Preserving Encodings for Straggler-Optimal Distributed Matrix Computations at the Edge

TL;DR

This paper tackles the challenge of preserving input sparsity in straggler-resilient distributed matrix computations for edge learning. It introduces a rigorous encoding-weight framework and proves a lower bound on the number of submatrices that must participate in any encoded computation, then develops sparsity-preserving MV and MM schemes that meet this bound. The proposed methods extend to heterogeneous edge devices, provide per-device computational and stability analyses, and are validated on AWS with highly sparse matrices, showing faster computation, reduced communication, and improved numerical robustness compared to dense-code baselines. Overall, the work enables faster, more scalable edge learning by combining sparsity-aware coding with straggler resilience, applicable to both MV and MM tasks.

Abstract

Matrix computations are a fundamental building-block of edge computing systems, with a major recent uptick in demand due to their use in AI/ML training and inference procedures. Existing approaches for distributing matrix computations involve allocating coded combinations of submatrices to worker nodes, to build resilience to slower nodes, called stragglers. In the edge learning context, however, these approaches will compromise sparsity properties that are often present in the original matrices found at the edge server. In this study, we consider the challenge of augmenting such approaches to preserve input sparsity when distributing the task across edge devices, thereby retaining the associated computational efficiency enhancements. First, we find a lower bound on the weight of coding, i.e., the number of submatrices to be combined to obtain coded submatrices, to provide the resilience to the maximum possible number of straggler devices (for given number of devices and their storage constraints). Next we propose distributed matrix computation schemes which meet the exact lower bound on the weight of the coding. Numerical experiments conducted in Amazon Web Services (AWS) validate our assertions regarding straggler mitigation and computation speed for sparse matrices.
Paper Structure (20 sections, 7 theorems, 20 equations, 6 figures, 3 tables, 2 algorithms)

This paper contains 20 sections, 7 theorems, 20 equations, 6 figures, 3 tables, 2 algorithms.

Key Result

Proposition 1

Consider a distributed computation system of $n$ total devices each of which can store $1/k_A$ fraction of matrix $\mathbf{A}$ (and $1/k_B$ fraction of matrix $\mathbf{B}$ for the matrix-matrix multiplication case). Now, assume that a coded matrix computation scheme aims at resilience to $s = n - k$

Figures (6)

  • Figure 1: Submatrix allocation by the edge server to a system with $n = 6$ devices, $s = 2$ stragglers and $\gamma_A = \frac{1}{4}$ according to Alg. \ref{['Alg:New_matvec']}. Here, the weight of every coded submatrix is $\omega_A = \Bigl\lceil\frac{k_A(s+1)}{k_A + s}\Bigr\rceil = 2$. Any $\{\mathbf{A}_i, \mathbf{A}_j\}$ indicates a random linear combination of $\mathbf{A}_i$ and $\mathbf{A}_j$.
  • Figure 2: Submatrix allocation for $n = 12$ workers and $s = 3$ stragglers, with $\gamma_A = \frac{1}{9}$ according to Alg. \ref{['Alg:New_matvec']}. Here, the weight of every submatrix is $\omega_A = \Bigl\lceil\frac{k_A(s+1)}{k_A + s}\Bigr\rceil = 3$. Any $\{\mathbf{A}_i, \mathbf{A}_j, \mathbf{A}_k\}$ indicates a random linear combination of the corresponding submatrices where the coefficients are chosen i.i.d. at random from a continuous distribution.
  • Figure 3: A heterogeneous system where $\bar{n} = 8$ and $\bar{k}_A = 5$, and thus $n = 12$ and $k_A = 9$. First, $W_0$ is assigned thrice, and each of $W_1$ and $W_2$ is assigned twice the load of each of $W_3, W_4, \dots, W_7$. This system is resilient to any $s = 3$ block-column processing, i.e., it is resilient to any three type $0$ nodes (e.g., $W_3$ and $W_6$) or any one type $2$ node (e.g., $W_0$).
  • Figure 4: Submatrix allocation according to Alg. \ref{['Alg:New_matmat']} when $n = 20$ with $\gamma_A = \gamma_B = \frac{1}{4}$; thus resilient to $s = n - \frac{1}{\gamma_A \gamma_B} = 4$ straggler devices. The weights of the submatrices are $\omega_A = \omega_B = 2$. Any assignment $\{\mathbf{A}_i, \mathbf{A}_j\}$ or $\{\mathbf{B}_i, \mathbf{B}_j\}$ indicates a random linear combination of the corresponding submatrices where the coefficients are chosen i.i.d. at random from a continuous distribution.
  • Figure 5: Comparison of encoding weights in matrix-vector (MV) and matrix-matrix (MM) multiplication between the method in das2023distributed, our proposed approach, and the theoretical lower bound for different choices of $n$ and $s$.
  • ...and 1 more figures

Theorems & Definitions (35)

  • Definition 1
  • Proposition 1
  • proof
  • Corollary 1
  • proof
  • Example 1
  • Lemma 1
  • proof
  • Claim 1
  • Example 2
  • ...and 25 more