Coded Computing for Resilient Distributed Computing: A Learning-Theoretic Framework

Parsa Moradi; Behrooz Tahmasebi; Mohammad Ali Maddah-Ali

Coded Computing for Resilient Distributed Computing: A Learning-Theoretic Framework

Parsa Moradi, Behrooz Tahmasebi, Mohammad Ali Maddah-Ali

TL;DR

This work proposes a novel foundation for coded computing, integrating the principles of learning theory, and developing a framework that seamlessly adapts with machine learning applications, and demonstrates that the proposed framework outperforms the state-of-the-art in terms of accuracy and rate of convergence.

Abstract

Coded computing has emerged as a promising framework for tackling significant challenges in large-scale distributed computing, including the presence of slow, faulty, or compromised servers. In this approach, each worker node processes a combination of the data, rather than the raw data itself. The final result then is decoded from the collective outputs of the worker nodes. However, there is a significant gap between current coded computing approaches and the broader landscape of general distributed computing, particularly when it comes to machine learning workloads. To bridge this gap, we propose a novel foundation for coded computing, integrating the principles of learning theory, and developing a framework that seamlessly adapts with machine learning applications. In this framework, the objective is to find the encoder and decoder functions that minimize the loss function, defined as the mean squared error between the estimated and true values. Facilitating the search for the optimum decoding and functions, we show that the loss function can be upper-bounded by the summation of two terms: the generalization error of the decoding function and the training error of the encoding function. Focusing on the second-order Sobolev space, we then derive the optimal encoder and decoder. We show that in the proposed solution, the mean squared error of the estimation decays with the rate of $\mathcal{O}(S^3 N^{-3})$ and $\mathcal{O}(S^{\frac{8}{5}}N^{\frac{-3}{5}})$ in noiseless and noisy computation settings, respectively, where $N$ is the number of worker nodes with at most $S$ slow servers (stragglers). Finally, we evaluate the proposed scheme on inference tasks for various machine learning models and demonstrate that the proposed framework outperforms the state-of-the-art in terms of accuracy and rate of convergence.

Coded Computing for Resilient Distributed Computing: A Learning-Theoretic Framework

TL;DR

Abstract

and

in noiseless and noisy computation settings, respectively, where

is the number of worker nodes with at most

slow servers (stragglers). Finally, we evaluate the proposed scheme on inference tasks for various machine learning models and demonstrate that the proposed framework outperforms the state-of-the-art in terms of accuracy and rate of convergence.

Paper Structure (31 sections, 20 theorems, 122 equations, 7 figures, 3 tables)

This paper contains 31 sections, 20 theorems, 122 equations, 7 figures, 3 tables.

Introduction
Preliminaries and Problem Definition
Notations
Problem Setting
Proposed Framework: LeTCC
Objective
Main Results
Experimental Results
Related Work
Conclusions and Future Work
Acknowledgments
Preliminaries
Sobolev spaces and Sobolev norms
Smoothing Splines
Proof of Theorems
...and 16 more sections

Key Result

Theorem 1

Consider the $\texttt{LeTCC}$ framework with $N$ worker nodes and at most $S$ stragglers. Assume $\{\alpha_k\}^K_{k=1}$ are arbitrary and distinct points in $\Omega=(-1, 1)$ and there exist a constant $J$ that $\Delta_{\textrm{max}} \leqslant \frac{J}{N}$. If $f(\cdot)$ is a $q$-Lipschitz continuous where $C_1$ is a constant.

Figures (7)

Figure 1: $\texttt{LeTCC}$ framework.
Figure 2: Performance comparison of $\texttt{LeTCC}$ and BACC with a $95\%$ confidence interval across a diverse range of stragglers for different models in a low-redundancy regime (smaller $\frac{N}{K}$).
Figure 3: Performance comparison of $\texttt{LeTCC}$ and BACC with a $95\%$ confidence interval across a diverse range of stragglers for different models in a high-redundancy regime (larger $\frac{N}{K}$).
Figure 4: Average performance of $\texttt{LeTCC}$ and Lagrange Coded Computing, with a 95% confidence interval. Plots (a) and (d) show the overall performance, while the zoomed-in subplots (b) and (c) highlight the performance for smaller range of stragglers.
Figure 5: Sensitivity of $\texttt{LeTCC}$ performance with respect to $\log_{10}(\lambda_\textrm{d})$ and $\log_{10}(\lambda_\textrm{e})$. The yellow line represents the performance when the variable smoothing parameter is set to zero.
...and 2 more figures

Theorems & Definitions (31)

Theorem 1: Upper bound for noiseless computation, $\sigma_0 = 0$
Theorem 2: Upper bound for noisy computation
Theorem 3
Proposition 1
Theorem 4: Convergence rate
Definition 1: Sobolev Space
Definition 2
Theorem 5: Theorem 7.34, leoni2024first
Corollary 1
proof
...and 21 more

Coded Computing for Resilient Distributed Computing: A Learning-Theoretic Framework

TL;DR

Abstract

Coded Computing for Resilient Distributed Computing: A Learning-Theoretic Framework

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (31)