CAESAR: Enhancing Federated RL in Heterogeneous MDPs through Convergence-Aware Sampling with Screening

Hei Yi Mak; Flint Xiaofeng Fan; Luca A. Lanzendörfer; Cheston Tan; Wei Tsang Ooi; Roger Wattenhofer

CAESAR: Enhancing Federated RL in Heterogeneous MDPs through Convergence-Aware Sampling with Screening

Hei Yi Mak, Flint Xiaofeng Fan, Luca A. Lanzendörfer, Cheston Tan, Wei Tsang Ooi, Roger Wattenhofer

TL;DR

The paper addresses Federated Reinforcement Learning in environments with heterogeneous MDPs, where naive averaging of value functions can hinder learning. It introduces CAESAR, a two-layer aggregation scheme that combines convergence-aware sampling with a screening step to selectively incorporate knowledge from peers processing similar MDPs while prioritizing higher-performing agents. Empirical results on GridWorld and FrozenLake-v1 show CAESAR outperforms standard All and Sampling, matching or exceeding the hypothetical Peers baseline and demonstrating robustness across varying degrees of heterogeneity. The work advances practical FedRL by enabling efficient, environment-sensitive knowledge transfer without requiring prior knowledge of the number of MDPs or agent-to-MDP assignments.

Abstract

In this study, we delve into Federated Reinforcement Learning (FedRL) in the context of value-based agents operating across diverse Markov Decision Processes (MDPs). Existing FedRL methods typically aggregate agents' learning by averaging the value functions across them to improve their performance. However, this aggregation strategy is suboptimal in heterogeneous environments where agents converge to diverse optimal value functions. To address this problem, we introduce the Convergence-AwarE SAmpling with scReening (CAESAR) aggregation scheme designed to enhance the learning of individual agents across varied MDPs. CAESAR is an aggregation strategy used by the server that combines convergence-aware sampling with a screening mechanism. By exploiting the fact that agents learning in identical MDPs are converging to the same optimal value function, CAESAR enables the selective assimilation of knowledge from more proficient counterparts, thereby significantly enhancing the overall learning efficiency. We empirically validate our hypothesis and demonstrate the effectiveness of CAESAR in enhancing the learning efficiency of agents, using both a custom-built GridWorld environment and the classical FrozenLake-v1 task, each presenting varying levels of environmental heterogeneity.

CAESAR: Enhancing Federated RL in Heterogeneous MDPs through Convergence-Aware Sampling with Screening

TL;DR

Abstract

Paper Structure (15 sections, 15 equations, 9 figures, 1 algorithm)

This paper contains 15 sections, 15 equations, 9 figures, 1 algorithm.

Introduction
Preliminaries
Federated Reinforcement Learning with Heterogeneous Environments
Aggregation Schemes
Self
All
Peers
Sampling
CAESAR
Screen
Empirical Evaluation
Experimental Settings
Hypothesis Verification Using GridWorld
Effectiveness evaluation using FrozenLake-v1
Conclusion

Figures (9)

Figure 1: Two heterogeneous MDPs. MDP $M_1$ rewards $-1$ for action $0$ and $+1$ for action $1$, while MDP $M_2$ rewards $+1$ for action $0$ and $-1$ for action $1$. The optimal value functions are $Q_1(s_0,0)=-1, Q_1(s_0,1)=1$ for $M_1$, and $Q_2(s_0,0)=1, Q_2(s_0,1)=-1$ for $M_2$, respectively. Averaging these value functions results in $\bar{Q}(s_0,0) = \bar{Q}(s_0,1) = 0$, showing a misrepresentation of optimal values for both MDPs.
Figure 2: Two GridWorld MDPs. Their initial states are $0$. In MDP 1 (top), transiting from state $4$ to $5$ generates a reward of $+1$ and transiting from state $-4$ to $-5$ yields a reward of $-1$. In MDP 2 (bottom), the signs of the rewards are flipped.
Figure 3: FrozenLake-v1 environments generated by three different maps. The agent’s task is to navigate to the goal (the gift box) without falling into the holes.
Figure 4: Convergence of Q-values among peers in GridWorld under Self. Q-values of $M_1$ agents (blue) and $M_2$ agents (orange) converge to their respective optimal values (black dotted lines) for state-actions $(s=-4,a=\cdot)$ and $(s=-3,a=\cdot)$ in GridWorld. $\epsilon$ is set to $0.9$ to speed up convergence.
Figure 5: Average performance of the $N=20$ agents in GridWorld under different averaging schemes with exploration rate $\epsilon=0.1$. The plot averages independent runs over 30 random seeds where the shadows represent the $95\%$ confidence intervals.
...and 4 more figures

CAESAR: Enhancing Federated RL in Heterogeneous MDPs through Convergence-Aware Sampling with Screening

TL;DR

Abstract

CAESAR: Enhancing Federated RL in Heterogeneous MDPs through Convergence-Aware Sampling with Screening

Authors

TL;DR

Abstract

Table of Contents

Figures (9)