ATA: Adaptive Task Allocation for Efficient Resource Management in Distributed Machine Learning
Artavazd Maranjyan, El Mehdi Saad, Peter Richtárik, Francesco Orabona
TL;DR
The paper tackles efficient resource management for distributed machine learning under asynchronous execution by learning how to allocate a fixed task budget $B$ across $n$ heterogeneous workers with unknown compute times. It recasts the problem as a non-linear stochastic bandit with partial feedback and introduces Adaptive Task Allocation (ATA), a lower-confidence-bound strategy that minimizes a proxy loss $\ell(\bm{a},\bm{\mu}) = \max_i a_i \mu_i$ and yields a near-optimal total computation time. The authors prove that ATA achieves a total time within a factor $(1+4\eta\ln B)$ of the optimum with full knowledge of arm distributions, plus a logarithmic additive term in the horizon, and provide an empirical variant (ATA-Empirical) with data-dependent concentration for better practical performance. Through simulations and a CIFAR-100 CNN experiment, the method demonstrates reduced resource waste and competitive runtime compared with greedy and oracle-fixed allocations, highlighting its potential for cost-effective distributed learning. The work advances resource allocation for asynchronous ML by bridging online bandit methods with non-linear, combinatorial task assignment and providing rigorous performance guarantees.
Abstract
Asynchronous methods are fundamental for parallelizing computations in distributed machine learning. They aim to accelerate training by fully utilizing all available resources. However, their greedy approach can lead to inefficiencies using more computation than required, especially when computation times vary across devices. If the computation times were known in advance, training could be fast and resource-efficient by assigning more tasks to faster workers. The challenge lies in achieving this optimal allocation without prior knowledge of the computation time distributions. In this paper, we propose ATA (Adaptive Task Allocation), a method that adapts to heterogeneous and random distributions of worker computation times. Through rigorous theoretical analysis, we show that ATA identifies the optimal task allocation and performs comparably to methods with prior knowledge of computation times. Experimental results further demonstrate that ATA is resource-efficient, significantly reducing costs compared to the greedy approach, which can be arbitrarily expensive depending on the number of workers.
