Table of Contents
Fetching ...

Data-Aware Gradient Compression for FL in Communication-Constrained Mobile Computing

Rongwei Lu, Yutong Jiang, Yinan Mao, Chen Tang, Bin Chen, Laizhong Cui, Zhi Wang

TL;DR

This work derives the convergence rate of distributed SGD with non-uniform compression, which reveals the intricate relationship between model convergence and the compression ratios applied to individual workers, and proposes DAGC-R, which assigns conservative compression to workers handling larger data volumes.

Abstract

Federated Learning (FL) in mobile environments faces significant communication bottlenecks. Gradient compression has proven as an effective solution to this issue, offering substantial benefits in environments with limited bandwidth and metered data. Yet, it encounters severe performance drops in non-IID environments due to a one-size-fits-all compression approach, which does not account for the varying data volumes across workers. Assigning varying compression ratios to workers with distinct data distributions and volumes is therefore a promising solution. This work derives the convergence rate of distributed SGD with non-uniform compression, which reveals the intricate relationship between model convergence and the compression ratios applied to individual workers. Accordingly, we frame the relative compression ratio assignment as an $n$-variable chi-squared nonlinear optimization problem, constrained by a limited communication budget. We propose DAGC-R, which assigns conservative compression to workers handling larger data volumes. Recognizing the computational limitations of mobile devices, we propose the DAGC-A, which is computationally less demanding and enhances the robustness of compression in non-IID scenarios. Our experiments confirm that the DAGC-R and DAGC-A can speed up the training speed by up to $25.43\%$ and $16.65\%$ compared to the uniform compression respectively, when dealing with highly imbalanced data volume distribution and restricted communication.

Data-Aware Gradient Compression for FL in Communication-Constrained Mobile Computing

TL;DR

This work derives the convergence rate of distributed SGD with non-uniform compression, which reveals the intricate relationship between model convergence and the compression ratios applied to individual workers, and proposes DAGC-R, which assigns conservative compression to workers handling larger data volumes.

Abstract

Federated Learning (FL) in mobile environments faces significant communication bottlenecks. Gradient compression has proven as an effective solution to this issue, offering substantial benefits in environments with limited bandwidth and metered data. Yet, it encounters severe performance drops in non-IID environments due to a one-size-fits-all compression approach, which does not account for the varying data volumes across workers. Assigning varying compression ratios to workers with distinct data distributions and volumes is therefore a promising solution. This work derives the convergence rate of distributed SGD with non-uniform compression, which reveals the intricate relationship between model convergence and the compression ratios applied to individual workers. Accordingly, we frame the relative compression ratio assignment as an -variable chi-squared nonlinear optimization problem, constrained by a limited communication budget. We propose DAGC-R, which assigns conservative compression to workers handling larger data volumes. Recognizing the computational limitations of mobile devices, we propose the DAGC-A, which is computationally less demanding and enhances the robustness of compression in non-IID scenarios. Our experiments confirm that the DAGC-R and DAGC-A can speed up the training speed by up to and compared to the uniform compression respectively, when dealing with highly imbalanced data volume distribution and restricted communication.
Paper Structure (28 sections, 13 theorems, 52 equations, 5 figures, 6 tables, 3 algorithms)

This paper contains 28 sections, 13 theorems, 52 equations, 5 figures, 6 tables, 3 algorithms.

Key Result

Theorem 1

Consider a function $f$, which maps from $\mathbb{R}^d$ to $\mathbb{R}$, and is $L$-consistent. We can find a learning rate $\gamma$ such that $\gamma \leq \frac{1}{4LZ} \frac{\delta_{min}}{\sqrt{n C_Z}}$, where $C_Z = \sum_{i=1}^n\frac{\delta_{min}}{\delta_i}p_i^2$. This means that the number of iterations of non-uniform D-EF-SGD with the relative compressor ensures $\mathbb{E}f(\textbf{x}_{fina

Figures (5)

  • Figure 1: High-level design of DAGC. DAGC sets different compression ratios to workers depending on the worker size. Large workers (i.e., the workers with large data volumes and similarly to small and medium workers) are assigned conservative compression ratios, and small workers adopt aggressive compression ratios.
  • Figure 2: The accuracy curves (Accuracy vs. Iterations) of Logistic@FMNIST (a) and LSTM@SCs (b) using different relative compression strategies. In scheme I (as well as scheme II), large workers are set lower (higher) compression ratios. The uniform compression is a one-size-fits-all strategy. Among these three strategies, non-uniform compression scheme I exhibits optimal performance.
  • Figure 3: The label distribution for Flickr (a) and training curves (Accuracy vs. Iterations) for VGG11s@Flickr under the relative compression ((b)-(d)) and the absolute compression ((e)-(g)) on different compression levels (left to right). DAGC outperforms other uniform compression strategies facing limited communication under the fixed budget.
  • Figure 4: The training curves (Accuracy vs. Time) for VGG11s@Flickr under the relative compression (a) and the absolute compression (b). DAGC outperforms other compression strategies, and the training is without compression.
  • Figure 5: The training curves (Accuracy vs. Iterations) for ResNet18 @CIFAR-10 and VGG11@CIFAR-100 under the relative compression (a, c) and the absolute compression (b, d). DAGC performs better in all cases.

Theorems & Definitions (17)

  • Theorem 1: Non-convex convergence rate of non-uniform D-EF-SGD with the relative compressor
  • Theorem 2: Convex convergence rate of non-uniform D-EF-SGD with the relative compressor, i.e., $\mu = 0$
  • Theorem 3: Strong convex convergence rate of non-uniform D-EF-SGD with the relative compressor, i.e., $\mu > 0$
  • Theorem 4: Optimal $\delta_i$
  • Remark 1
  • Remark 2
  • Theorem 5: Non-convex convergence rate of non-uniform D-EF-SGD with the absolute compressor
  • Theorem 6: Convex convergence rate of non-uniform D-EF-SGD with the absolute compressor, i.e., $\mu = 0$
  • Theorem 7: Strong convex convergence rate of non-uniform D-EF-SGD with the absolute compressor, i.e., $\mu > 0$
  • Theorem 8: Conversion from $\lambda$ to $\delta$ and optimal $\lambda_i$
  • ...and 7 more