Federated Distributional Reinforcement Learning with Distributional Critic Regularization

David Millard; Cecilia Alm; Rashid Ali; Pengcheng Shi; Ali Baheri

Federated Distributional Reinforcement Learning with Distributional Critic Regularization

David Millard, Cecilia Alm, Rashid Ali, Pengcheng Shi, Ali Baheri

Abstract

Federated reinforcement learning typically aggregates value functions or policies by parameter averaging, which emphasizes expected return and can obscure statistical multimodality and tail behavior that matter in safety-critical settings. We formalize federated distributional reinforcement learning (FedDistRL), where clients parametrize quantile value function critics and federate these networks only. We also propose TR-FedDistRL, which builds a per client, risk-aware Wasserstein barycenter over a temporal buffer. This local barycenter provides a reference region to constrain the parameter averaged critic, ensuring necessary distributional information is not averaged out during the federation process. The distributional trust region is implemented as a shrink-squash step around this reference. Under fixed-policy evaluation, the feasibility map is nonexpansive and the update is contractive in a probe-set Wasserstein metric under evaluation. Experiments on a bandit, multi-agent gridworld, and continuous highway environment show reduced mean-smearing, improved safety proxies (catastrophe/accident rate), and lower critic/policy drift versus mean-oriented and non-federated baselines.

Federated Distributional Reinforcement Learning with Distributional Critic Regularization

Abstract

Paper Structure (25 sections, 4 theorems, 34 equations, 4 figures, 2 tables)

This paper contains 25 sections, 4 theorems, 34 equations, 4 figures, 2 tables.

INTRODUCTION
RELATED WORKS
Federated Reinforcement Learning
Risk-Averse Reinforcement Learning
Trust Regions & Reinforcement Learning
Optimal Transport & Reinforcement Learning
PRELIMINARIES & PROBLEM FORMULATION
Distributional Reinforcement Learning
Wasserstein Barycenter
Problem Formulation
METHODOLOGY
Local Updates
Aggregation
Trust Region
Initialization
...and 10 more sections

Key Result

Lemma 1

For each round $n$, assume the tube squash $\Phi^{(n)}(\cdot;\bar{q})$ acts coordinatewise and satisfies: (i) anchoring $\Phi^{(n)}(\bar{q};\bar{q})=\bar{q}$; and (ii) coordinatewise Lipschitz for all $x,y\in\mathbb R^K$ and all $k$, $|(\Phi^{(n)}(x;\bar{q}))_k-(\Phi^{(n)}(y;\bar{q}))_k| \le \beta_k

Figures (4)

Figure 1: Each panel shows how aggregation operators combine scalar, multimodal client reward distributions in the bandit setting (i.e., $Z^\pi=r_t$, $\gamma=0$). Each client is parameterized with QR-DQN models and averaging is computed via Equation \ref{['eq:param_avg']}. Left: critic parameter averaging (FedAvg) induces an output-space arithmetic averaging effect that can smear modes and attenuate tail mass (mean-smearing). Middle: an unweighted Wasserstein-1 barycenter aggregates in distribution space and better preserves geometric structure (modes/tails) compared with arithmetic averaging. Right: a CVaR-risk-weighted Wasserstein-1 barycenter biases the barycenter toward lower-tail behavior by upweighting clients with larger lower-tail risk (CVaR at level $\alpha = 0.1$), helping retain tail-relevant mass while still preserving multimodality.
Figure 2: Case Study 2 client environments for Clients 1--3. Each panel shows one client’s environment with obstacles, hazard zone, agent start/goal locations, and object start/goal locations. The three client environments are arranged for direct side-by-side comparison of environmental heterogeneity. All instantiations use a $10\times10$ grid with 3 agents and 3 objects.
Figure 3: Bars report the percent change in catastrophe rate for each heterogeneous client relative to the Local baseline (baseline corresponds to 0% by definition), aggregated over 30 random seeds; negative values indicate fewer catastrophes (safer) than Local. Catastrophe rate is the fraction of evaluation episodes that terminate in a designated catastrophic event (environment-defined unsafe terminal condition). Error bars denote variability across seeds (standard deviation).
Figure 4: Accident rate across training steps for each heterogeneous client in the highway environment (Case Study 3), comparing Local, FedAvg, FedAvg (CVaR), and FedAvg (CVaR + TR) over 30 random seeds. Shaded regions denote standard deviation across seeds. Lower accident rate indicates safer behavior. FedAvg (CVaR + TR) consistently achieves the lowest accident rate across clients, demonstrating that the proposed risk-aware trust-region regularization improves safety in a continuous-control setting under environment heterogeneity.

Theorems & Definitions (10)

Definition 1: Empirical-measure lift and induced metric
Definition 2: FedDistRL with risk-aware reference
Remark 1
Remark 2
Lemma 1: Tube squash: anchoring and Lipschitz in $d$
Lemma 2: Tube-cap bound in $d$
Theorem 1: Stability around a moving reference
proof : Proof
Corollary 1: Geometric tracking bound
Remark 3

Federated Distributional Reinforcement Learning with Distributional Critic Regularization

Abstract

Federated Distributional Reinforcement Learning with Distributional Critic Regularization

Authors

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (10)