Table of Contents
Fetching ...

Machine Learning and CPU (Central Processing Unit) Scheduling Co-Optimization over a Network of Computing Centers

Mohammadreza Doostmohammadian, Zulfiya R. Gabidullina, Hamid R. Rabiee

TL;DR

This work addresses the problem of co-optimizing CPU scheduling and distributed machine learning across a network of computing centers. It proposes a networked, consensus-based gradient-tracking algorithm with both linear (ideal exchange) and nonlinear (log-scale quantized exchange) variants to solve a two-block objective: $\min \sum_i f_i(\mathbf{x}_i) + g_i(\mathbf{y}_i)$ subject to $\mathbf{y}_1=\cdots=\mathbf{y}_n$ and $\sum_i \mathbf{x}_i=b$, while maintaining all-time feasibility. Convergence is established via perturbation theory and Lyapunov analysis, showing that for sufficiently small $\alpha$ (and, with quantization, $\alpha<|\lambda_2|/(L(1+\rho/2))$) the algorithm reaches the optimal point even on time-varying, connected networks. Empirical results on distributed SVM and linear regression, plus a MNIST-based experiment, demonstrate consensus on ML parameters, sustained feasibility of resource allocation, and substantial gains in cost-optimality gaps compared to existing CPU-scheduling methods, with robust performance under log-scale quantization. These findings offer a scalable, fault-tolerant approach to decentralized resource management for distributed ML workloads in data-center networks and edge-cloud ecosystems.

Abstract

In the rapidly evolving research on artificial intelligence (AI) the demand for fast, computationally efficient, and scalable solutions has increased in recent years. The problem of optimizing the computing resources for distributed machine learning (ML) and optimization is considered in this paper. Given a set of data distributed over a network of computing-nodes/servers, the idea is to optimally assign the CPU (central processing unit) usage while simultaneously training each computing node locally via its own share of data. This formulates the problem as a co-optimization setup to (i) optimize the data processing and (ii) optimally allocate the computing resources. The information-sharing network among the nodes might be time-varying, but with balanced weights to ensure consensus-type convergence of the algorithm. The algorithm is all-time feasible, which implies that the computing resource-demand balance constraint holds at all iterations of the proposed solution. Moreover, the solution allows addressing possible log-scale quantization over the information-sharing channels to exchange log-quantized data. For some example applications, distributed support-vector-machine (SVM) and regression are considered as the ML training models. Results from perturbation theory, along with Lyapunov stability and eigen-spectrum analysis, are used to prove the convergence towards the optimal case. As compared to existing CPU scheduling solutions, the proposed algorithm improves the cost optimality gap by more than $50\%$.

Machine Learning and CPU (Central Processing Unit) Scheduling Co-Optimization over a Network of Computing Centers

TL;DR

This work addresses the problem of co-optimizing CPU scheduling and distributed machine learning across a network of computing centers. It proposes a networked, consensus-based gradient-tracking algorithm with both linear (ideal exchange) and nonlinear (log-scale quantized exchange) variants to solve a two-block objective: subject to and , while maintaining all-time feasibility. Convergence is established via perturbation theory and Lyapunov analysis, showing that for sufficiently small (and, with quantization, ) the algorithm reaches the optimal point even on time-varying, connected networks. Empirical results on distributed SVM and linear regression, plus a MNIST-based experiment, demonstrate consensus on ML parameters, sustained feasibility of resource allocation, and substantial gains in cost-optimality gaps compared to existing CPU-scheduling methods, with robust performance under log-scale quantization. These findings offer a scalable, fault-tolerant approach to decentralized resource management for distributed ML workloads in data-center networks and edge-cloud ecosystems.

Abstract

In the rapidly evolving research on artificial intelligence (AI) the demand for fast, computationally efficient, and scalable solutions has increased in recent years. The problem of optimizing the computing resources for distributed machine learning (ML) and optimization is considered in this paper. Given a set of data distributed over a network of computing-nodes/servers, the idea is to optimally assign the CPU (central processing unit) usage while simultaneously training each computing node locally via its own share of data. This formulates the problem as a co-optimization setup to (i) optimize the data processing and (ii) optimally allocate the computing resources. The information-sharing network among the nodes might be time-varying, but with balanced weights to ensure consensus-type convergence of the algorithm. The algorithm is all-time feasible, which implies that the computing resource-demand balance constraint holds at all iterations of the proposed solution. Moreover, the solution allows addressing possible log-scale quantization over the information-sharing channels to exchange log-quantized data. For some example applications, distributed support-vector-machine (SVM) and regression are considered as the ML training models. Results from perturbation theory, along with Lyapunov stability and eigen-spectrum analysis, are used to prove the convergence towards the optimal case. As compared to existing CPU scheduling solutions, the proposed algorithm improves the cost optimality gap by more than .

Paper Structure

This paper contains 17 sections, 7 theorems, 62 equations, 7 figures, 1 table, 1 algorithm.

Key Result

Lemma 1

SensNets:Olfati04olfatisaberfaxmurray07 For a network satisfying Assumption ass_net, all the eigenvalues of $\overline{W}_\gamma$ are real-valued and negative, except one isolated zero eigenvalue with left (and right) eigenvector $\mathbf{1}_n^\top$ (and $\mathbf{1}_n$), i.e., $\mathbf{1}_n^\top \ov

Figures (7)

  • Figure 1: Comparison between uniform quantization (left) and log-scale quantization (right): uniform quantization is not a sector-bound nonlinearity, while logarithmic quantization is sector-bounded, assigning more bits to represent smaller values and fewer bits to larger values.
  • Figure 2: The data points in 2D used for the SVM classification and the associated SVM classifier line.
  • Figure 3: Time-evolution of the overall cost residual (optimality gap), assigned CPU resources $\mathbf{x}_i$, and SVM parameters ${\omega}_i,\nu_i$ under the proposed Algorithm \ref{['alg_1']}.
  • Figure 4: Time-evolution of the overall cost residual (optimality gap), assigned CPU resources $\mathbf{x}_i$, and regression parameters $\beta_i,\nu_i$ under the proposed Algorithm \ref{['alg_1']}.
  • Figure 5: The top-left figure shows the global objective function $\frac{1}{n}\sum_{i=1}^{n} \frac{1}{m}\sum_{j=1}^{N} g_{i,j}(y_i)$. The other four figures show the local non-convex objective functions at four sample nodes. This shows an example non-convex local objective function $g_i(\cdot)$ satisfying Assumption \ref{['ass_cost']}.
  • ...and 2 more figures

Theorems & Definitions (18)

  • Remark 2
  • Remark 3
  • Remark 4
  • Lemma 1
  • Lemma 2
  • proof
  • Remark 5
  • Lemma 3
  • proof
  • Lemma 4
  • ...and 8 more