Sparsity-Preserving Encodings for Straggler-Optimal Distributed Matrix Computations at the Edge

Anindya Bijoy Das; Aditya Ramamoorthy; David J. Love; Christopher G. Brinton

Sparsity-Preserving Encodings for Straggler-Optimal Distributed Matrix Computations at the Edge

Anindya Bijoy Das, Aditya Ramamoorthy, David J. Love, Christopher G. Brinton

TL;DR

This paper tackles the challenge of preserving input sparsity in straggler-resilient distributed matrix computations for edge learning. It introduces a rigorous encoding-weight framework and proves a lower bound $\hat{\omega}$ on the number of submatrices that must participate in any encoded computation, then develops sparsity-preserving MV and MM schemes that meet this bound. The proposed methods extend to heterogeneous edge devices, provide per-device computational and stability analyses, and are validated on AWS with highly sparse matrices, showing faster computation, reduced communication, and improved numerical robustness compared to dense-code baselines. Overall, the work enables faster, more scalable edge learning by combining sparsity-aware coding with straggler resilience, applicable to both MV and MM tasks.

Abstract

Matrix computations are a fundamental building-block of edge computing systems, with a major recent uptick in demand due to their use in AI/ML training and inference procedures. Existing approaches for distributing matrix computations involve allocating coded combinations of submatrices to worker nodes, to build resilience to slower nodes, called stragglers. In the edge learning context, however, these approaches will compromise sparsity properties that are often present in the original matrices found at the edge server. In this study, we consider the challenge of augmenting such approaches to preserve input sparsity when distributing the task across edge devices, thereby retaining the associated computational efficiency enhancements. First, we find a lower bound on the weight of coding, i.e., the number of submatrices to be combined to obtain coded submatrices, to provide the resilience to the maximum possible number of straggler devices (for given number of devices and their storage constraints). Next we propose distributed matrix computation schemes which meet the exact lower bound on the weight of the coding. Numerical experiments conducted in Amazon Web Services (AWS) validate our assertions regarding straggler mitigation and computation speed for sparse matrices.

Sparsity-Preserving Encodings for Straggler-Optimal Distributed Matrix Computations at the Edge

TL;DR

on the number of submatrices that must participate in any encoded computation, then develops sparsity-preserving MV and MM schemes that meet this bound. The proposed methods extend to heterogeneous edge devices, provide per-device computational and stability analyses, and are validated on AWS with highly sparse matrices, showing faster computation, reduced communication, and improved numerical robustness compared to dense-code baselines. Overall, the work enables faster, more scalable edge learning by combining sparsity-aware coding with straggler resilience, applicable to both MV and MM tasks.

Abstract

Paper Structure (20 sections, 7 theorems, 20 equations, 6 figures, 3 tables, 2 algorithms)

This paper contains 20 sections, 7 theorems, 20 equations, 6 figures, 3 tables, 2 algorithms.

Introduction
Problem Formulation, Background and Summary of Contributions
Problem Formulation
Existing Methods and Our Motivations
Summary of Contributions
Minimum Weight of Encoding
Proposed Matrix-vector Multiplication Approach
Straggler Resilience Guarantee
Extension to Heterogeneous System
Computational Complexity for a edge device
Numerical Stability and Coefficient Determination Time
Proposed Matrix-matrix Multiplication Approach
Structure of the Job Assignment
Rearrangement of ${\mathcal{M}}_i$'s
Computational Complexity for a Worker Node
...and 5 more sections

Key Result

Proposition 1

Consider a distributed computation system of $n$ total devices each of which can store $1/k_A$ fraction of matrix $\mathbf{A}$ (and $1/k_B$ fraction of matrix $\mathbf{B}$ for the matrix-matrix multiplication case). Now, assume that a coded matrix computation scheme aims at resilience to $s = n - k$

Figures (6)

Figure 1: Submatrix allocation by the edge server to a system with $n = 6$ devices, $s = 2$ stragglers and $\gamma_A = \frac{1}{4}$ according to Alg. \ref{['Alg:New_matvec']}. Here, the weight of every coded submatrix is $\omega_A = \Bigl\lceil\frac{k_A(s+1)}{k_A + s}\Bigr\rceil = 2$. Any $\{\mathbf{A}_i, \mathbf{A}_j\}$ indicates a random linear combination of $\mathbf{A}_i$ and $\mathbf{A}_j$.
Figure 2: Submatrix allocation for $n = 12$ workers and $s = 3$ stragglers, with $\gamma_A = \frac{1}{9}$ according to Alg. \ref{['Alg:New_matvec']}. Here, the weight of every submatrix is $\omega_A = \Bigl\lceil\frac{k_A(s+1)}{k_A + s}\Bigr\rceil = 3$. Any $\{\mathbf{A}_i, \mathbf{A}_j, \mathbf{A}_k\}$ indicates a random linear combination of the corresponding submatrices where the coefficients are chosen i.i.d. at random from a continuous distribution.
Figure 3: A heterogeneous system where $\bar{n} = 8$ and $\bar{k}_A = 5$, and thus $n = 12$ and $k_A = 9$. First, $W_0$ is assigned thrice, and each of $W_1$ and $W_2$ is assigned twice the load of each of $W_3, W_4, \dots, W_7$. This system is resilient to any $s = 3$ block-column processing, i.e., it is resilient to any three type $0$ nodes (e.g., $W_3$ and $W_6$) or any one type $2$ node (e.g., $W_0$).
Figure 4: Submatrix allocation according to Alg. \ref{['Alg:New_matmat']} when $n = 20$ with $\gamma_A = \gamma_B = \frac{1}{4}$; thus resilient to $s = n - \frac{1}{\gamma_A \gamma_B} = 4$ straggler devices. The weights of the submatrices are $\omega_A = \omega_B = 2$. Any assignment $\{\mathbf{A}_i, \mathbf{A}_j\}$ or $\{\mathbf{B}_i, \mathbf{B}_j\}$ indicates a random linear combination of the corresponding submatrices where the coefficients are chosen i.i.d. at random from a continuous distribution.
Figure 5: Comparison of encoding weights in matrix-vector (MV) and matrix-matrix (MM) multiplication between the method in das2023distributed, our proposed approach, and the theoretical lower bound for different choices of $n$ and $s$.
...and 1 more figures

Theorems & Definitions (35)

Definition 1
Proposition 1
proof
Corollary 1
proof
Example 1
Lemma 1
proof
Claim 1
Example 2
...and 25 more

Sparsity-Preserving Encodings for Straggler-Optimal Distributed Matrix Computations at the Edge

TL;DR

Abstract

Sparsity-Preserving Encodings for Straggler-Optimal Distributed Matrix Computations at the Edge

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (35)