Table of Contents
Fetching ...

PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices

Minghao Yan, Hongyi Wang, Shivaram Venkataraman

TL;DR

PolyThrottle addresses the challenge of energy-efficient edge inference by jointly tuning CPU, GPU, memory frequencies, and batch size under latency SLOs using Constrained Bayesian Optimization. It reveals that memory frequency and minimum GPU frequency can dominate energy use, and introduces a workload-interference model to schedule on-device fine-tuning without violating SLOs, achieving up to 36% energy savings with minimal online overhead. The framework combines offline near-optimal configuration search with an online predictor and scheduler to handle fine-tuning concurrently with inference. Implemented on Nvidia Jetson TX2/Orin with EfficientNet and BERT workloads, PolyThrottle demonstrates practical energy reductions and fast convergence across diverse models and hardware platforms.

Abstract

As neural networks (NN) are deployed across diverse sectors, their energy demand correspondingly grows. While several prior works have focused on reducing energy consumption during training, the continuous operation of ML-powered systems leads to significant energy use during inference. This paper investigates how the configuration of on-device hardware-elements such as GPU, memory, and CPU frequency, often neglected in prior studies, affects energy consumption for NN inference with regular fine-tuning. We propose PolyThrottle, a solution that optimizes configurations across individual hardware components using Constrained Bayesian Optimization in an energy-conserving manner. Our empirical evaluation uncovers novel facets of the energy-performance equilibrium showing that we can save up to 36 percent of energy for popular models. We also validate that PolyThrottle can quickly converge towards near-optimal settings while satisfying application constraints.

PolyThrottle: Energy-efficient Neural Network Inference on Edge Devices

TL;DR

PolyThrottle addresses the challenge of energy-efficient edge inference by jointly tuning CPU, GPU, memory frequencies, and batch size under latency SLOs using Constrained Bayesian Optimization. It reveals that memory frequency and minimum GPU frequency can dominate energy use, and introduces a workload-interference model to schedule on-device fine-tuning without violating SLOs, achieving up to 36% energy savings with minimal online overhead. The framework combines offline near-optimal configuration search with an online predictor and scheduler to handle fine-tuning concurrently with inference. Implemented on Nvidia Jetson TX2/Orin with EfficientNet and BERT workloads, PolyThrottle demonstrates practical energy reductions and fast convergence across diverse models and hardware platforms.

Abstract

As neural networks (NN) are deployed across diverse sectors, their energy demand correspondingly grows. While several prior works have focused on reducing energy consumption during training, the continuous operation of ML-powered systems leads to significant energy use during inference. This paper investigates how the configuration of on-device hardware-elements such as GPU, memory, and CPU frequency, often neglected in prior studies, affects energy consumption for NN inference with regular fine-tuning. We propose PolyThrottle, a solution that optimizes configurations across individual hardware components using Constrained Bayesian Optimization in an energy-conserving manner. Our empirical evaluation uncovers novel facets of the energy-performance equilibrium showing that we can save up to 36 percent of energy for popular models. We also validate that PolyThrottle can quickly converge towards near-optimal settings while satisfying application constraints.
Paper Structure (21 sections, 3 equations, 9 figures, 8 tables)

This paper contains 21 sections, 3 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Figure illustrating the overall workflow of PolyThrottle. The optimizer first identifies the optimal hardware configuration for a given model. When new data arrives, the inference server handles the inference requests. Upon receiving a fine-tuning request, our performance predictor estimates whether time-sharing inference and fine-tuning workloads would result in SLO violations. Then the predictor searches for feasible adjustments to meet the SLO constraints. If such adjustments are identified, the system implements the changes and schedules fine-tuning requests until completion.
  • Figure 2: Left figure shows the Pareto Frontier of energy vs. latency tradeoff for various batch sizes on EfficientNet B7 on Jetson Orin. Right figure shows the Pareto Frontier of energy vs. latency tradeoff for various batch sizes on EfficientNet B4 on Jetson TX2. Each data point in this plot is representative of a unique hardware configuration, and each line corresponds to a batch size. The figure shows that the tradeoff does not always conform to the same pattern across varied hardware platforms and models.
  • Figure 3: Left figure shows per query energy cost as we vary the GPU frequency and memory frequency for EfficientNet B4 on Jetson TX2 versus varying memory and GPU frequency with batch size fixed at 1. Right figure shows per query energy cost as we vary the minimum and maximum GPU frequency. As we increase the minimum GPU frequency, energy cost decreases.
  • Figure 4: This figure compares search efficiency between Constrained Bayesian Optimization and Random Search. The y-axis represents the number of attempts it takes to find a near-optimal configuration and the x-axis represents the deployed and associated quantization level. The first row corresponds to the setting where we set a latency target but restrict the batch size to 1. The second row where we relax the latency constraint and allow batching inference requests.
  • Figure 5: Left figure shows per query energy cost as we vary the GPU frequency and memory frequency for Bert at FP16 on Jetson TX2 versus varying Memory and GPU frequency with batch size fixed at 1. Right figure shows per query energy cost as we vary the GPU frequency and memory frequency for Bert at FP32 on Jetson TX2 versus varying Memory and GPU frequency with batch size fixed at 1.
  • ...and 4 more figures