Sparsity-Preserving Encodings for Straggler-Optimal Distributed Matrix Computations at the Edge
Anindya Bijoy Das, Aditya Ramamoorthy, David J. Love, Christopher G. Brinton
TL;DR
This paper tackles the challenge of preserving input sparsity in straggler-resilient distributed matrix computations for edge learning. It introduces a rigorous encoding-weight framework and proves a lower bound $\hat{\omega}$ on the number of submatrices that must participate in any encoded computation, then develops sparsity-preserving MV and MM schemes that meet this bound. The proposed methods extend to heterogeneous edge devices, provide per-device computational and stability analyses, and are validated on AWS with highly sparse matrices, showing faster computation, reduced communication, and improved numerical robustness compared to dense-code baselines. Overall, the work enables faster, more scalable edge learning by combining sparsity-aware coding with straggler resilience, applicable to both MV and MM tasks.
Abstract
Matrix computations are a fundamental building-block of edge computing systems, with a major recent uptick in demand due to their use in AI/ML training and inference procedures. Existing approaches for distributing matrix computations involve allocating coded combinations of submatrices to worker nodes, to build resilience to slower nodes, called stragglers. In the edge learning context, however, these approaches will compromise sparsity properties that are often present in the original matrices found at the edge server. In this study, we consider the challenge of augmenting such approaches to preserve input sparsity when distributing the task across edge devices, thereby retaining the associated computational efficiency enhancements. First, we find a lower bound on the weight of coding, i.e., the number of submatrices to be combined to obtain coded submatrices, to provide the resilience to the maximum possible number of straggler devices (for given number of devices and their storage constraints). Next we propose distributed matrix computation schemes which meet the exact lower bound on the weight of the coding. Numerical experiments conducted in Amazon Web Services (AWS) validate our assertions regarding straggler mitigation and computation speed for sparse matrices.
