Dependable Distributed Training of Compressed Machine Learning Models

Francesco Malandrino; Giuseppe Di Giacomo; Marco Levorato; Carla Fabiana Chiasserini

Dependable Distributed Training of Compressed Machine Learning Models

Francesco Malandrino, Giuseppe Di Giacomo, Marco Levorato, Carla Fabiana Chiasserini

TL;DR

DepL addresses the need for dependable training in distributed ML by guaranteeing a target learning quality with a specified probability while minimizing cost. It jointly optimizes data selection, model version switching between full and compressed networks, and cluster/resource allocation, employing an outer dataset-minimization loop, a discretized expanded-graph approach for model selection, and a VNF-placement-based method for resource allocation. The authors prove NP-hardness of the base problem, establish a quadratic worst-case complexity bound, and derive a constant competitive ratio, demonstrating that DepL closely matches the optimum and outperforms a state-of-the-art, expectation-focused baseline. Empirical results on AlexNet and MobileNet show DepL yields near-optimal costs and reliable loss guarantees, with performance robust to discretization granularity and model choice. The work advances dependable ML training by integrating probabilistic loss modeling, model compression, and distributed resource orchestration, with potential to pair with conformal-prediction techniques for post-hoc reliability improvements.

Abstract

The existing work on the distributed training of machine learning (ML) models has consistently overlooked the distribution of the achieved learning quality, focusing instead on its average value. This leads to a poor dependability}of the resulting ML models, whose performance may be much worse than expected. We fill this gap by proposing DepL, a framework for dependable learning orchestration, able to make high-quality, efficient decisions on (i) the data to leverage for learning, (ii) the models to use and when to switch among them, and (iii) the clusters of nodes, and the resources thereof, to exploit. For concreteness, we consider as possible available models a full DNN and its compressed versions. Unlike previous studies, DepL guarantees that a target learning quality is reached with a target probability, while keeping the training cost at a minimum. We prove that DepL has constant competitive ratio and polynomial complexity, and show that it outperforms the state-of-the-art by over 27% and closely matches the optimum.

Dependable Distributed Training of Compressed Machine Learning Models

TL;DR

Abstract

Paper Structure (13 sections, 5 equations, 7 figures, 1 table)

This paper contains 13 sections, 5 equations, 7 figures, 1 table.

Introduction
The Importance of Dependable Training
System Model and Problem Formulation
The DepL Solution
Dataset selection
Model selection
Node and resource allocation
Problem and Solution Analysis
Numerical Results
Estimating the loss evolution
Performance evaluation
Related work
Conclusions

Figures (7)

Figure 1: DepL enables the optimization of dependable, distributed training of neural networks over heterogeneous nodes, datasets, and model architectures. The optimal configuration is designed to control the uncertainty in the loss progression along training epochs.
Figure 2: Example of evolution of the test loss during training of the AlexNet CNN model pruned with 0.5 and 0.75 factor. Lines represent the expected loss, while shaded areas are delimited by the loss 1st and 99th percentiles.
Figure 3: The main steps of the DepL solution strategy.
Figure 4: Example distributions of $\mathcal{X}$ as defined in (\ref{['eq:model']}) along with the Inverse-Gamma fit (left); distribution of the $p$-values highlighting fitting quality (center); example distribution of $l^\text{sw}$ as defined in (\ref{['eq:ell-k']}) along with the Student's t fit (right).
Figure 5: AlexNet: performance of DepL and alternative benchmarks: cost (left); normalized expected loss (center); effect of $\eta$ (right).
...and 2 more figures

Dependable Distributed Training of Compressed Machine Learning Models

TL;DR

Abstract

Dependable Distributed Training of Compressed Machine Learning Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)