Distributed Convolutional Neural Network Training on Mobile and Edge Clusters
Pranav Rama, Madison Threadgill, Andreas Gerstlauer
TL;DR
The paper addresses the challenge of training CNNs on resource-constrained mobile and edge devices without relying on centralized cloud coordination. It introduces a tile-based forward and backward partitioning scheme with fusion of locally computed tiles and a layer grouping mechanism to balance computation and communication, preserving locality. Empirical results on Yolov2 using a Raspberry Pi cluster show substantial gains, including 2x–15x speedups and up to 8x per-device memory reduction, with grouping providing additional improvements under certain profiles. This edge-only training approach offers latency and privacy benefits while outlining practical paths for future enhancements such as weight partitioning and extension to weight-dominated layers.
Abstract
The training of deep and/or convolutional neural networks (DNNs/CNNs) is traditionally done on servers with powerful CPUs and GPUs. Recent efforts have emerged to localize machine learning tasks fully on the edge. This brings advantages in reduced latency and increased privacy, but necessitates working with resource-constrained devices. Approaches for inference and training in mobile and edge devices based on pruning, quantization or incremental and transfer learning require trading off accuracy. Several works have explored distributing inference operations on mobile and edge clusters instead. However, there is limited literature on distributed training on the edge. Existing approaches all require a central, potentially powerful edge or cloud server for coordination or offloading. In this paper, we describe an approach for distributed CNN training exclusively on mobile and edge devices. Our approach is beneficial for the initial CNN layers that are feature map dominated. It is based on partitioning forward inference and back-propagation operations among devices through tiling and fusing to maximize locality and expose communication and memory-aware parallelism. We also introduce the concept of layer grouping to further fine-tune performance based on computation and communication trade-off. Results show that for a cluster of 2-6 quad-core Raspberry Pi3 devices, training of an object-detection CNN provides a 2x-15x speedup with respect to a single core and up to 8x reduction in memory usage per device, all without sacrificing accuracy. Grouping offers up to 1.5x speedup depending on the reference profile and batch size.
