Dual-Lagrange Encoding for Storage and Download in Elastic Computing for Resilience
Xi Zhong, Samuel Lu, Joerg Kliewer, Mingyue Ji
TL;DR
The paper tackles efficient, resilient matrix-matrix computations in elastic cloud environments by introducing Dual-Lagrange Encoding for Storage and Download (LCSD). By encoding both A and B with Lagrange codes and employing two partition-based schemes, it achieves reduced storage use and managed download costs while tolerating stragglers and elasticity; a storage-sharing framework further enables flexible trade-offs. The authors validate the approach with AWS EC2 experiments, showing notable gains for heterogeneous assignments and quantifying the impact of stragglers on performance. This work provides a practical, scalable framework for resilient elastic computing in distributed matrix operations, with concrete encoding/decoding strategies and deployment guidance.
Abstract
Coded elastic computing enables virtual machines to be preempted for high-priority tasks while allowing new virtual machines to join ongoing computation seamlessly. This paper addresses coded elastic computing for matrix-matrix multiplications with straggler tolerance by encoding both storage and download using Lagrange codes. In 2018, Yang et al. introduced the first coded elastic computing scheme for matrix-matrix multiplications, achieving a lower computational load requirement. However, this scheme lacks straggler tolerance and suffers from high upload cost. Zhong et al. (2023) later tackled these shortcomings by employing uncoded storage and Lagrange-coded download. However, their approach requires each machine to store the entire dataset. This paper introduces a new class of elastic computing schemes that utilize Lagrange codes to encode both storage and download, achieving a reduced storage size. The proposed schemes efficiently mitigate both elasticity and straggler effects, with a storage size reduced to a fraction $\frac{1}{L}$ of Zhong et al.'s approach, at the expense of doubling the download cost. Moreover, we evaluate the proposed schemes on AWS EC2 by measuring computation time under two different tasks allocations: heterogeneous and cyclic assignments. Both assignments minimize computation redundancy of the system while distributing varying computation loads across machines.
