Table of Contents
Fetching ...

Energy-Aware Heterogeneous Federated Learning via Approximate DNN Accelerators

Kilian Pfeiffer, Konstantinos Balaskas, Kostas Siozios, Jörg Henkel

TL;DR

This work addresses the challenge of heterogeneous energy resources in federated learning by introducing training-capable on-device accelerators tailored to each device’s energy budget. It combines compressed arithmetic formats and approximate computing within a hardware-aware energy model to substantially reduce training energy, up to about 4×, without sacrificing global model accuracy or fairness across devices. Unlike prior algorithmic approaches, the method designs the hardware at the device level to accommodate energy constraints while keeping full model capacity. The results demonstrate improved energy efficiency and fairness in FL, with practical implications for deploying privacy-preserving learning on resource-constrained edge devices.

Abstract

In Federated Learning (FL), devices that participate in the training usually have heterogeneous resources, i.e., energy availability. In current deployments of FL, devices that do not fulfill certain hardware requirements are often dropped from the collaborative training. However, dropping devices in FL can degrade training accuracy and introduce bias or unfairness. Several works have tackled this problem on an algorithm level, e.g., by letting constrained devices train a subset of the server neural network (NN) model. However, it has been observed that these techniques are not effective w.r.t. accuracy. Importantly, they make simplistic assumptions about devices' resources via indirect metrics such as multiply accumulate (MAC) operations or peak memory requirements. We observe that memory access costs (that are currently not considered in simplistic metrics) have a significant impact on the energy consumption. In this work, for the first time, we consider on-device accelerator design for FL with heterogeneous devices. We utilize compressed arithmetic formats and approximate computing, targeting to satisfy limited energy budgets. Using a hardware-aware energy model, we observe that, contrary to the state of the art's moderate energy reduction, our technique allows for lowering the energy requirements (by 4x) while maintaining higher accuracy.

Energy-Aware Heterogeneous Federated Learning via Approximate DNN Accelerators

TL;DR

This work addresses the challenge of heterogeneous energy resources in federated learning by introducing training-capable on-device accelerators tailored to each device’s energy budget. It combines compressed arithmetic formats and approximate computing within a hardware-aware energy model to substantially reduce training energy, up to about 4×, without sacrificing global model accuracy or fairness across devices. Unlike prior algorithmic approaches, the method designs the hardware at the device level to accommodate energy constraints while keeping full model capacity. The results demonstrate improved energy efficiency and fairness in FL, with practical implications for deploying privacy-preserving learning on resource-constrained edge devices.

Abstract

In Federated Learning (FL), devices that participate in the training usually have heterogeneous resources, i.e., energy availability. In current deployments of FL, devices that do not fulfill certain hardware requirements are often dropped from the collaborative training. However, dropping devices in FL can degrade training accuracy and introduce bias or unfairness. Several works have tackled this problem on an algorithm level, e.g., by letting constrained devices train a subset of the server neural network (NN) model. However, it has been observed that these techniques are not effective w.r.t. accuracy. Importantly, they make simplistic assumptions about devices' resources via indirect metrics such as multiply accumulate (MAC) operations or peak memory requirements. We observe that memory access costs (that are currently not considered in simplistic metrics) have a significant impact on the energy consumption. In this work, for the first time, we consider on-device accelerator design for FL with heterogeneous devices. We utilize compressed arithmetic formats and approximate computing, targeting to satisfy limited energy budgets. Using a hardware-aware energy model, we observe that, contrary to the state of the art's moderate energy reduction, our technique allows for lowering the energy requirements (by 4x) while maintaining higher accuracy.
Paper Structure (14 sections, 8 equations, 7 figures, 4 tables)

This paper contains 14 sections, 8 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: We envision fl system, where each fl device is equipped with a specific systolic array accelerator, where its specifications are decided at design time. Depending on the devices' environments, accelerators with specific approximate processing elements and compressed SRAM are used. The full details of our accelerator design are visualized in \ref{['fig:accelerator']}. The compression and approximation levels C1-C5 are listed in \ref{['tab:accelerator']}.
  • Figure 2: Schematic overview of our designed training-capable accelerator, comprising of a sa, simd array, on-chip SRAM buffers and off-chip DRAM. mac-based PEs (light green) within the sa contain an approximate mantissa multiplier, which dictates the compressed format for SRAM storage (purple). Gray components remain in FP32 format.
  • Figure 3: Schematic overview of our proposed approximate mac unit (left). Each mac unit in our sa is equipped with hardware approximation techniques at both the memory and computational levels (right). We utilize compressed arithmetic formats to fetch and store weights and input activations. Additionally, we employ the state-of-the-art mbm floating-point multiplier saadat2018minimally, which leverages a linear version of the logarithmic multiplication property along with error correction for approximate mantissa multiplications.
  • Figure 4: Per-component energy breakdown for each accelerator configuration (C1-C5, left) and nn subset (S1-S4, right), according to our energy model for a single mini-batch training step with ResNet20 and $3\times32\times32$ input size. SRAM accesses and sa computations dominate the overall energy consumption. Color coding according to \ref{['fig:accelerator']}.
  • Figure 5: Exemplary visualization of non-iid (left) and rc-non-iid (right) distributions for a total of 16 devices and 10 classes. Devices are equally grouped into three groups, where group one uses C1, and group 2 and 3 use CX (or SX in the case of the baselines) as described in our setup. The size of the circle represents the quantity of data samples of a specific class on a specific device. It can be seen on the right, contrary to the standard non-iid on the left, that in this case there is a correlation between groups (i.e., device resources) and the quantity of class-specific data samples.
  • ...and 2 more figures