Table of Contents
Fetching ...

BAFFLE: A Baseline of Backpropagation-Free Federated Learning

Haozhe Feng, Tianyu Pang, Chao Du, Wei Chen, Shuicheng Yan, Min Lin

TL;DR

The paper tackles the practical bottleneck of backpropagation in federated learning on edge devices by introducing BAFFLE, a backpropagation-free FL framework that uses zero-order gradient estimation from forward passes. BAFFLE relies on Gaussian perturbations of the global model and Stein's identity to obtain an unbiased gradient surrogate, which is communicated as a vector of loss differences and securely aggregated across clients. The authors provide convergence guarantees showing an unbiased estimator with a rate of $\mathcal{O}(\sqrt{n/K})$, and demonstrate empirical viability on MNIST, CIFAR-10/100, and OfficeHome with memory and bandwidth efficiency, TEEs compatibility, and robustness considerations. While BAFFLE incurs some accuracy trade-offs, especially under severe non-IID distributions, it offers a promising path for privacy-preserving, resource-constrained FL where backpropagation is impractical.

Abstract

Federated learning (FL) is a general principle for decentralized clients to train a server model collectively without sharing local data. FL is a promising framework with practical applications, but its standard training paradigm requires the clients to backpropagate through the model to compute gradients. Since these clients are typically edge devices and not fully trusted, executing backpropagation on them incurs computational and storage overhead as well as white-box vulnerability. In light of this, we develop backpropagation-free federated learning, dubbed BAFFLE, in which backpropagation is replaced by multiple forward processes to estimate gradients. BAFFLE is 1) memory-efficient and easily fits uploading bandwidth; 2) compatible with inference-only hardware optimization and model quantization or pruning; and 3) well-suited to trusted execution environments, because the clients in BAFFLE only execute forward propagation and return a set of scalars to the server. Empirically we use BAFFLE to train deep models from scratch or to finetune pretrained models, achieving acceptable results. Code is available in https://github.com/FengHZ/BAFFLE.

BAFFLE: A Baseline of Backpropagation-Free Federated Learning

TL;DR

The paper tackles the practical bottleneck of backpropagation in federated learning on edge devices by introducing BAFFLE, a backpropagation-free FL framework that uses zero-order gradient estimation from forward passes. BAFFLE relies on Gaussian perturbations of the global model and Stein's identity to obtain an unbiased gradient surrogate, which is communicated as a vector of loss differences and securely aggregated across clients. The authors provide convergence guarantees showing an unbiased estimator with a rate of , and demonstrate empirical viability on MNIST, CIFAR-10/100, and OfficeHome with memory and bandwidth efficiency, TEEs compatibility, and robustness considerations. While BAFFLE incurs some accuracy trade-offs, especially under severe non-IID distributions, it offers a promising path for privacy-preserving, resource-constrained FL where backpropagation is impractical.

Abstract

Federated learning (FL) is a general principle for decentralized clients to train a server model collectively without sharing local data. FL is a promising framework with practical applications, but its standard training paradigm requires the clients to backpropagate through the model to compute gradients. Since these clients are typically edge devices and not fully trusted, executing backpropagation on them incurs computational and storage overhead as well as white-box vulnerability. In light of this, we develop backpropagation-free federated learning, dubbed BAFFLE, in which backpropagation is replaced by multiple forward processes to estimate gradients. BAFFLE is 1) memory-efficient and easily fits uploading bandwidth; 2) compatible with inference-only hardware optimization and model quantization or pruning; and 3) well-suited to trusted execution environments, because the clients in BAFFLE only execute forward propagation and return a set of scalars to the server. Empirically we use BAFFLE to train deep models from scratch or to finetune pretrained models, achieving acceptable results. Code is available in https://github.com/FengHZ/BAFFLE.
Paper Structure (19 sections, 2 theorems, 15 equations, 5 figures, 4 tables)

This paper contains 19 sections, 2 theorems, 15 equations, 5 figures, 4 tables.

Key Result

theorem thmcountertheorem

(Proof in Appendix A) Suppose $\sigma$ is a small value and the central difference scheme in Eq. (eq:fdcenter) holds. For perturbations $\{\bm{\delta}_k\}_{k=1}^{K}\overset{\mathrm{iid}}{\sim} \mathcal{N}(0,\sigma^2{\mathbf{I}})$, the empirical covariance matrix is $\widehat{\mathbf{\Sigma}}:=\frac{ where $\mathbb{E}[\widehat{\mathbf{\Sigma}}]={\mathbf{I}}\textrm{, }\mathbb{E}[\widehat{\bm{\delta}

Figures (5)

  • Figure 1: A sketch map of BAFFLE. In addition to the global parameters update $\Delta{\mathbf{W}}$, each client downloads random seeds to locally generate perturbations $\pm\bm{\delta}_{1:K}$ and perform $2K$ times of forward propagation (i.e., inference) to compute loss differences. The server can recover these perturbations using the same random seeds and obtain $\Delta\mathcal{L}({\mathbf{W}},\bm{\delta}_k)$ by secure aggregation. Each loss difference $\Delta\mathcal{L}({\mathbf{W}},\bm{\delta}_{k};\mathbb{D}_{c})$ is a floating-point number, so $K$ can be easily adjusted to fit the uploading bandwidth.
  • Figure 2: The classification accuracy (%) of BAFFLE in iid scenarios ($C=10$) and batch-level communication settings with various $K$ values. We treat the models trained by exact gradients on conventional FL systems as the backpropagation (BP) baselines. On different datasets and architectures, our BAFFLE achieves comparable performance to the exact gradient results with a reasonable $K$.
  • Figure 3: The ablation study of BAFFLE guidelines, with $K=100$ on MNIST and $K=500$ on CIFAR-10. As seen, twice-FD, Hardswish, and EMA all improve performance without extra computation. EMA reduces oscillations by lessening Gaussian noise.
  • Figure 4: A sketch map to run BAFFLE in one trusted execution environment. The pipeline contains three steps: (1) Load the data and model into the security storage. (2) Load the code of BAFFLE into the root of trust. (3) Run the BAFFLE program in a separation kernel.
  • Figure 5: The robustness of BAFFLE to inference attacks. For real data, we randomly sample some input-label pairs from the validation dataset. For random noise, we generate input-label pairs from standard normal distribution. We sample $500$ perturbations $\bm{\delta}$ from $\mathcal{N}(0,\sigma^2{\mathbf{I}})$, collect the values of $\Delta \mathcal{L}({\mathbf{W}},\bm{\delta};{\mathbb{D}})$ for real data and random noise separately, and compare their distributions.

Theorems & Definitions (2)

  • theorem thmcountertheorem
  • theorem thmcountertheorem