Table of Contents
Fetching ...

FedHB: Hierarchical Bayesian Federated Learning

Minyoung Kim, Timothy Hospedales

TL;DR

This work proposes a novel hierarchical Bayesian approach to Federated Learning (FL), where the model reasonably describes the generative process of clients'local data via hierarchical Bayesian modeling: constituting random variables of local models for clients that are governed by a higher-level global variate.

Abstract

We propose a novel hierarchical Bayesian approach to Federated Learning (FL), where our model reasonably describes the generative process of clients' local data via hierarchical Bayesian modeling: constituting random variables of local models for clients that are governed by a higher-level global variate. Interestingly, the variational inference in our Bayesian model leads to an optimisation problem whose block-coordinate descent solution becomes a distributed algorithm that is separable over clients and allows them not to reveal their own private data at all, thus fully compatible with FL. We also highlight that our block-coordinate algorithm has particular forms that subsume the well-known FL algorithms including Fed-Avg and Fed-Prox as special cases. Beyond introducing novel modeling and derivations, we also offer convergence analysis showing that our block-coordinate FL algorithm converges to an (local) optimum of the objective at the rate of $O(1/\sqrt{t})$, the same rate as regular (centralised) SGD, as well as the generalisation error analysis where we prove that the test error of our model on unseen data is guaranteed to vanish as we increase the training data size, thus asymptotically optimal.

FedHB: Hierarchical Bayesian Federated Learning

TL;DR

This work proposes a novel hierarchical Bayesian approach to Federated Learning (FL), where the model reasonably describes the generative process of clients'local data via hierarchical Bayesian modeling: constituting random variables of local models for clients that are governed by a higher-level global variate.

Abstract

We propose a novel hierarchical Bayesian approach to Federated Learning (FL), where our model reasonably describes the generative process of clients' local data via hierarchical Bayesian modeling: constituting random variables of local models for clients that are governed by a higher-level global variate. Interestingly, the variational inference in our Bayesian model leads to an optimisation problem whose block-coordinate descent solution becomes a distributed algorithm that is separable over clients and allows them not to reveal their own private data at all, thus fully compatible with FL. We also highlight that our block-coordinate algorithm has particular forms that subsume the well-known FL algorithms including Fed-Avg and Fed-Prox as special cases. Beyond introducing novel modeling and derivations, we also offer convergence analysis showing that our block-coordinate FL algorithm converges to an (local) optimum of the objective at the rate of , the same rate as regular (centralised) SGD, as well as the generalisation error analysis where we prove that the test error of our model on unseen data is guaranteed to vanish as we increase the training data size, thus asymptotically optimal.
Paper Structure (37 sections, 4 theorems, 110 equations, 5 figures, 9 tables, 10 algorithms)

This paper contains 37 sections, 4 theorems, 110 equations, 5 figures, 9 tables, 10 algorithms.

Key Result

Theorem 1

We denote the objective function in (eq:elbo) by $f(x)$ where $x = [x_0,x_1,\dots,x_N]$ corresponding to the variational parameters $x_0:=L_0$, $x_1:=L_1$, …, $x_N:=L_N$. Let $\eta_t = \overline{L} + \sqrt{t}$ for some constant $\overline{L}$, and $\overline{x}^T = \frac{1}{T}\sum_{t=1}^T x^t$, whe where $x^*$ is the (local) optimum, $D$, and $R_f$ are some constants, and the expectation is taken

Figures (5)

  • Figure 1: Graphical models. (a) Plate view of iid clients. (b) Individual client data with input images $x$ given and only $p(y|x)$ modeled. (c) $\&$ (d): Global prediction and personalisation as probabilistic inference problems (shaded nodes $=$evidences, red colored nodes $=$targets to infer, $x^*=$ test input in global prediction, $D^p=$ training data for personalisation and $x^p=$ test input).
  • Figure 2: Hyperparameter sensitivity analysis and comparison with simple ensemble baselines.
  • Figure 3: CIFAR-100 training dynamics. (Left) Training curves as FL rounds. (Right) Personalisation training curves. We also superimpose test accuracies.
  • Figure 4: MNIST training convergence with different numbers of participating clients. (Left) NIW and (Right) Mixture ($K=2$).
  • Figure 5: Comparison between our mixture model and ensemble baselines ($K$ varied) on CIFAR-100.

Theorems & Definitions (8)

  • Theorem 1: Convergence analysis
  • Remark
  • Theorem 2: Generalisation error bound
  • Remark
  • Remark
  • Lemma 3
  • Remark
  • Lemma 4: From the proof of Lemma 4.1 in bai20