Table of Contents
Fetching ...

Accelerated Training on Low-Power Edge Devices

Mohamed Aboelenien Ahmed, Kilian Pfeiffer, Heba Khdr, Osama Abboud, Ramin Khalili, Jörg Henkel

TL;DR

The paper addresses the challenge of on-device training under strict power constraints by proposing a cross-layer approach that jointly tunes the GPU frequency $f$ and batch size $b$ to accelerate training while meeting $P_{ ext{max}}$. It combines offline device profiling of $T_s(b,f,M)$ and $P(b,f,M)$ with server-side estimation of batch-size efficiency via a proxy dataset to minimize $TT_{ ext{acc}}(b,f,M,D) = T_s(b,f,M) \times N_{s_{acc}}(b,M,D)$ under $P(b,f,M) \le P_{ ext{max}}$. The method builds LUTs from profiling, uses a proxy dataset to infer convergence efficiency, and selects $(b,f)$ at runtime, achieving up to $2.4\times$ speedups and notable energy savings on CNNs and transformers on Jetson devices. This approach is practical for edge deployments, reduces training time and energy, and remains robust to proxy-dataset choices, supporting greener and more adaptable on-device learning.

Abstract

Training on edge devices poses several challenges as these devices are generally resource-constrained, especially in terms of power. State-of-the-art techniques at the device level reduce the GPU frequency to enforce power constraints, leading to a significant increase in training time. To accelerate training, we propose to jointly adjust the system and application parameters (in our case, the GPU frequency and the batch size of the training task) while adhering to the power constraints on devices. We introduce a novel cross-layer methodology that combines predictions of batch size efficiency and device profiling to achieve the desired optimization. Our evaluation on real hardware shows that our method outperforms the current baselines that depend on state of the art techniques, reducing the training time by $2.4\times$ with results very close to optimal. Our measurements also indicate a substantial reduction in the overall energy used for the training process. These gains are achieved without reduction in the performance of the trained model.

Accelerated Training on Low-Power Edge Devices

TL;DR

The paper addresses the challenge of on-device training under strict power constraints by proposing a cross-layer approach that jointly tunes the GPU frequency and batch size to accelerate training while meeting . It combines offline device profiling of and with server-side estimation of batch-size efficiency via a proxy dataset to minimize under . The method builds LUTs from profiling, uses a proxy dataset to infer convergence efficiency, and selects at runtime, achieving up to speedups and notable energy savings on CNNs and transformers on Jetson devices. This approach is practical for edge deployments, reduces training time and energy, and remains robust to proxy-dataset choices, supporting greener and more adaptable on-device learning.

Abstract

Training on edge devices poses several challenges as these devices are generally resource-constrained, especially in terms of power. State-of-the-art techniques at the device level reduce the GPU frequency to enforce power constraints, leading to a significant increase in training time. To accelerate training, we propose to jointly adjust the system and application parameters (in our case, the GPU frequency and the batch size of the training task) while adhering to the power constraints on devices. We introduce a novel cross-layer methodology that combines predictions of batch size efficiency and device profiling to achieve the desired optimization. Our evaluation on real hardware shows that our method outperforms the current baselines that depend on state of the art techniques, reducing the training time by with results very close to optimal. Our measurements also indicate a substantial reduction in the overall energy used for the training process. These gains are achieved without reduction in the performance of the trained model.

Paper Structure

This paper contains 24 sections, 4 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: Peak power and time for training a set of samples across different batch size and GPU frequency combinations. The gray plane represents the power limit in the left figure, and the black dots in both figures are the feasible combinations that can be utilized under that constraint. The black circle represents the operating point with maximum feasible frequencies for the batch sizes of 128, which will be selected by the state-of-the-art techniques. The green circle represents an operating point at batch size 64, that could be selected when the frequency and the batch size are jointly selected to accelerate training under the power constraint. Selecting this operating point accelerates training by $31.9\%$ (see Time curves).
  • Figure 2: The training time on fixed number of samples $T_s$ and the total training time $TT_\text{acc}$ to reach an accuracy threshold of $78\%$ using two batch sizes 8 and 32, while considering the maximum feasible GPU frequencies under three power constraints; $P_1=4.5W$, $P_2=5W$, and $P_3=7W$. We observe that for $P_1$ and $P_2$, selecting $b=8$ will lead to lower $TT_\text{acc}$, while for $P_3$ selecting $b=32$ is better. This is in contrast with our observation for $T_s$, where selecting $b=32$ is the best option in all cases.
  • Figure 3: Overview of our proposed cross-layer approach that accelerate training under power constraint though the joint selection of batch size $b$ and GPU frequency $f$.
  • Figure 4: Energy consumption comparison between our approach and baseline methods during training under three power constraint scenarios. The recorded data includes training ResNet18 and MobileNetV2 on the SVHN and CINIC datasets, as well as a transformers network on the Austen and Dickens dataset, all performed on an Nvidia Jetson Nano.
  • Figure 5: Confusion Matrix of time increase percentages to the fastest configuration for image classification datasets on Nvidia Jetson Nano across three power constraints for MobileNetV2 training. The rows represent the selected proxy dataset while the columns represent the target datasets where training on edge is conducted on. The results indicate that the proposed method is not sensitive to the selection of proxy dataset.
  • ...and 3 more figures