Uncoded Storage Coded Transmission Elastic Computing with Straggler Tolerance in Heterogeneous Systems
Xi Zhong, Joerg Kliewer, Mingyue Ji
TL;DR
USCTEC tackles elastic computing in heterogeneous systems by combining uncoded storage with coded transmission using $L$-ary Lagange codes to tolerate stragglers while performing matrix-matrix multiplication. It provides optimal USCTEC schemes without storage constraints for a fixed speed realization, and a heuristic algorithm for general speed distributions under storage constraints, via the subproblems $(l,\boldsymbol{s},\boldsymbol{\sigma})$-LP and $(\boldsymbol{\theta},\rho)$-DP. The framework yields a two-stage design: partitioning $\boldsymbol{A}$ into $G$ blocks, distributing storage $\boldsymbol{e}$, and constructing Lagrange-code-based transmissions to enable decoding with any $L$ out of $L+S$ evaluators. Compared with cyclic storage baselines, USCTEC achieves reduced storage size and comparable or improved expected computation time in heterogeneous, straggler-prone settings, with illustrative scenarios and performance tradeoffs.
Abstract
In 2018, Yang et al. introduced a novel and effective approach, using maximum distance separable (MDS) codes, to mitigate the impact of elasticity in cloud computing systems. This approach is referred to as coded elastic computing. Some limitations of this approach include that it assumes all virtual machines have the same computing speeds and storage capacities, and it cannot tolerate stragglers for matrix-matrix multiplications. In order to resolve these limitations, in this paper, we introduce a new combinatorial optimization framework, named uncoded storage coded transmission elastic computing (USCTEC), for heterogeneous speeds and storage constraints, aiming to minimize the expected computation time for matrix-matrix multiplications, under the consideration of straggler tolerance. Within this framework, we propose optimal solutions with straggler tolerance under relaxed storage constraints. Moreover, we propose a heuristic algorithm that considers the heterogeneous storage constraints. Our results demonstrate that the proposed algorithm outperforms baseline solutions utilizing cyclic storage placements, in terms of both expected computation time and storage size.
