Bayesian Federated Learning with Hamiltonian Monte Carlo: Algorithm and Theory

Jiajun Liang; Qian Zhang; Wei Deng; Qifan Song; Guang Lin

Bayesian Federated Learning with Hamiltonian Monte Carlo: Algorithm and Theory

Jiajun Liang, Qian Zhang, Wei Deng, Qifan Song, Guang Lin

TL;DR

This paper tackles uncertainty-aware Bayesian federated learning on non-iid data by proposing FA-HMC, a Federated Averaging approach built on stochastic gradient Hamiltonian Monte Carlo to sample from the global posterior π(θ) ∝ exp(−f(θ)) with $f(θ)=\sum_{c=1}^N w_c f^{(c)}(θ)$. It derives non-asymptotic Wasserstein-2 convergence guarantees under μ-strong convexity and Hessian smoothness, and shows how dimension $d$, gradient noise $σ_g$, momentum correlation $ρ$, and local update frequency influence convergence and communication costs; the analysis also establishes tightness via lower bounds. Empirically, FA-HMC outperforms FA-LD on simulated Bayesian logistic regression and real datasets (Fashion-MNIST, KMNIST, CIFAR-2), while achieving lower communication overhead and providing uncertainty quantification. The results suggest FA-HMC is robust to hyperparameters and suitable for privacy-conscious federated settings, with potential extensions to non-convex settings and heterogeneous local dynamics.

Abstract

This work introduces a novel and efficient Bayesian federated learning algorithm, namely, the Federated Averaging stochastic Hamiltonian Monte Carlo (FA-HMC), for parameter estimation and uncertainty quantification. We establish rigorous convergence guarantees of FA-HMC on non-iid distributed data sets, under the strong convexity and Hessian smoothness assumptions. Our analysis investigates the effects of parameter space dimension, noise on gradients and momentum, and the frequency of communication (between the central node and local nodes) on the convergence and communication costs of FA-HMC. Beyond that, we establish the tightness of our analysis by showing that the convergence rate cannot be improved even for continuous FA-HMC process. Moreover, extensive empirical studies demonstrate that FA-HMC outperforms the existing Federated Averaging-Langevin Monte Carlo (FA-LD) algorithm.

Bayesian Federated Learning with Hamiltonian Monte Carlo: Algorithm and Theory

TL;DR

. It derives non-asymptotic Wasserstein-2 convergence guarantees under μ-strong convexity and Hessian smoothness, and shows how dimension

, gradient noise

, momentum correlation

, and local update frequency influence convergence and communication costs; the analysis also establishes tightness via lower bounds. Empirically, FA-HMC outperforms FA-LD on simulated Bayesian logistic regression and real datasets (Fashion-MNIST, KMNIST, CIFAR-2), while achieving lower communication overhead and providing uncertainty quantification. The results suggest FA-HMC is robust to hyperparameters and suitable for privacy-conscious federated settings, with potential extensions to non-convex settings and heterogeneous local dynamics.

Abstract

Paper Structure (17 sections, 4 theorems, 17 equations, 7 figures, 3 algorithms)

This paper contains 17 sections, 4 theorems, 17 equations, 7 figures, 3 algorithms.

Introduction
Roadmap:
Preiminary
Problem Setup
Hamilton's Equations and HMC
FA-HMC Algorithm and Assumptions
Assumptions
Theoretical Results
Main Results
Convergence Behaviour for FA-HMC Algorithm
Experiments
Simulation: FA-HMC vs FA-LD
Simulation: Dimension vs Communication for FA-HMC
Application: Logistic Regression Model for FMNIST
Application: Neural Network Model for FMNIST
...and 2 more sections

Key Result

Theorem 4.1

Assume assum:convex-assum:boundvar, and ${\cal W}_2(\pi_0,\pi)^2=O As $d\rightarrow \infty$, we say $f=O(g)$ if $f\leq C g$ for some constant $C$, and say $f=\widetilde{O}(g)$ for $C$ being a polynomial of $\log(d)$. (d)$ and $\sum_{c=1}^Nw_c\|\nabla f^{(c)}(\theta^*)\|^2=O(d)$. For a given local i then ${\cal W}_2(\pi_{t_\epsilon},\pi)\leq \epsilon$ for any $\epsilon>0$, with iteration number a

Figures (7)

Figure 1: Experimental results of FA-HMC and FA-LD on the simulated dataset using exact gradients (G) and stochastic gradients (SG). Dimension $d=1000$ in Figure (a)-(c) and $d=10$ in Figure (d).
Figure 2: Experimental results of FA-HMC to achieve ${\cal W}_2<0.1$ at different dimensions $d$.
Figure 3: The impact of leapfrog steps $K$ on FA-HMC applied on the Fashion-MNIST dataset.
Figure 4: The impact of local steps $T$ on FA-HMC applied on the Fashion-MNIST dataset.
Figure 5: The impact of leapfrog step $K$ and local step $T$ on FA-HMC applied to train a two-hidden-layer neural network on the Fashion-MNIST datasets.
...and 2 more figures

Theorems & Definitions (5)

Theorem 4.1
Remark 4.2
Proposition 4.2
Theorem 4.3: Convergence
Proposition 4.3: Dynamic stepsize

Bayesian Federated Learning with Hamiltonian Monte Carlo: Algorithm and Theory

TL;DR

Abstract

Bayesian Federated Learning with Hamiltonian Monte Carlo: Algorithm and Theory

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (5)