Table of Contents
Fetching ...

Dual-Lagrange Encoding for Storage and Download in Elastic Computing for Resilience

Xi Zhong, Samuel Lu, Joerg Kliewer, Mingyue Ji

TL;DR

The paper tackles efficient, resilient matrix-matrix computations in elastic cloud environments by introducing Dual-Lagrange Encoding for Storage and Download (LCSD). By encoding both A and B with Lagrange codes and employing two partition-based schemes, it achieves reduced storage use and managed download costs while tolerating stragglers and elasticity; a storage-sharing framework further enables flexible trade-offs. The authors validate the approach with AWS EC2 experiments, showing notable gains for heterogeneous assignments and quantifying the impact of stragglers on performance. This work provides a practical, scalable framework for resilient elastic computing in distributed matrix operations, with concrete encoding/decoding strategies and deployment guidance.

Abstract

Coded elastic computing enables virtual machines to be preempted for high-priority tasks while allowing new virtual machines to join ongoing computation seamlessly. This paper addresses coded elastic computing for matrix-matrix multiplications with straggler tolerance by encoding both storage and download using Lagrange codes. In 2018, Yang et al. introduced the first coded elastic computing scheme for matrix-matrix multiplications, achieving a lower computational load requirement. However, this scheme lacks straggler tolerance and suffers from high upload cost. Zhong et al. (2023) later tackled these shortcomings by employing uncoded storage and Lagrange-coded download. However, their approach requires each machine to store the entire dataset. This paper introduces a new class of elastic computing schemes that utilize Lagrange codes to encode both storage and download, achieving a reduced storage size. The proposed schemes efficiently mitigate both elasticity and straggler effects, with a storage size reduced to a fraction $\frac{1}{L}$ of Zhong et al.'s approach, at the expense of doubling the download cost. Moreover, we evaluate the proposed schemes on AWS EC2 by measuring computation time under two different tasks allocations: heterogeneous and cyclic assignments. Both assignments minimize computation redundancy of the system while distributing varying computation loads across machines.

Dual-Lagrange Encoding for Storage and Download in Elastic Computing for Resilience

TL;DR

The paper tackles efficient, resilient matrix-matrix computations in elastic cloud environments by introducing Dual-Lagrange Encoding for Storage and Download (LCSD). By encoding both A and B with Lagrange codes and employing two partition-based schemes, it achieves reduced storage use and managed download costs while tolerating stragglers and elasticity; a storage-sharing framework further enables flexible trade-offs. The authors validate the approach with AWS EC2 experiments, showing notable gains for heterogeneous assignments and quantifying the impact of stragglers on performance. This work provides a practical, scalable framework for resilient elastic computing in distributed matrix operations, with concrete encoding/decoding strategies and deployment guidance.

Abstract

Coded elastic computing enables virtual machines to be preempted for high-priority tasks while allowing new virtual machines to join ongoing computation seamlessly. This paper addresses coded elastic computing for matrix-matrix multiplications with straggler tolerance by encoding both storage and download using Lagrange codes. In 2018, Yang et al. introduced the first coded elastic computing scheme for matrix-matrix multiplications, achieving a lower computational load requirement. However, this scheme lacks straggler tolerance and suffers from high upload cost. Zhong et al. (2023) later tackled these shortcomings by employing uncoded storage and Lagrange-coded download. However, their approach requires each machine to store the entire dataset. This paper introduces a new class of elastic computing schemes that utilize Lagrange codes to encode both storage and download, achieving a reduced storage size. The proposed schemes efficiently mitigate both elasticity and straggler effects, with a storage size reduced to a fraction of Zhong et al.'s approach, at the expense of doubling the download cost. Moreover, we evaluate the proposed schemes on AWS EC2 by measuring computation time under two different tasks allocations: heterogeneous and cyclic assignments. Both assignments minimize computation redundancy of the system while distributing varying computation loads across machines.

Paper Structure

This paper contains 19 sections, 6 equations, 3 figures, 1 table, 1 algorithm.

Figures (3)

  • Figure 1: Storage-sharing of Scheme $1$ in Example \ref{['ex-storage-sharing']}. The red, green, and blue lines represent to the cases for $S = 0$, $S =1$ and $S=2$, respectively.
  • Figure 2: Comparisons between Scheme $1$ and yang2018coded, yangCEC with $S = 0$ based on storage-sharing in Example \ref{['ex-storage-sharing']}. The blue and orange lines represent the storage-sharing of yang2018coded and yangCEC, respectively. The red line represents storage-sharing of Scheme $1$. The black line represents both yang2018coded and yangCEC.
  • Figure 3: Experiment results when $N = 20$ and $L = 5$. The red and blue lines represent heterogeneous assignment and cyclic assignment, respectively. The solid and dash lines represent the cases of $S = 0$ and $S = 4$, respectively.

Theorems & Definitions (4)

  • Definition 1
  • Remark 1
  • Remark 2
  • Example 1