Efficient Reinforcement Learning for Global Decision Making in the Presence of Local Agents at Scale

Emile Anand; Guannan Qu

Efficient Reinforcement Learning for Global Decision Making in the Presence of Local Agents at Scale

Emile Anand, Guannan Qu

TL;DR

This work proposes the SUBSAMPLE-Q algorithm where the global agent subsamples n local agents to compute a policy in time that is polynomial in $k$ and shows that this learned policy converges to the optimal policy in the order of $\tilde{O}(1/\sqrt{k}+{\epsilon}_{k,m})$ as the number of sub-sampled agents $k$ increases.

Abstract

We study reinforcement learning for global decision-making in the presence of local agents, where the global decision-maker makes decisions affecting all local agents, and the objective is to learn a policy that maximizes the joint rewards of all the agents. Such problems find many applications, e.g. demand response, EV charging, queueing, etc. In this setting, scalability has been a long-standing challenge due to the size of the state space which can be exponential in the number of agents. This work proposes the \texttt{SUBSAMPLE-Q} algorithm where the global agent subsamples $k\leq n$ local agents to compute a policy in time that is polynomial in $k$. We show that this learned policy converges to the optimal policy in the order of $\tilde{O}(1/\sqrt{k}+ε_{k,m})$ as the number of sub-sampled agents $k$ increases, where $ε_{k,m}$ is the Bellman noise. Finally, we validate the theory through numerical simulations in a demand-response setting and a queueing setting.

Efficient Reinforcement Learning for Global Decision Making in the Presence of Local Agents at Scale

TL;DR

This work proposes the SUBSAMPLE-Q algorithm where the global agent subsamples n local agents to compute a policy in time that is polynomial in

and shows that this learned policy converges to the optimal policy in the order of

as the number of sub-sampled agents

increases.

Abstract

local agents to compute a policy in time that is polynomial in

. We show that this learned policy converges to the optimal policy in the order of

as the number of sub-sampled agents

increases, where

is the Bellman noise. Finally, we validate the theory through numerical simulations in a demand-response setting and a queueing setting.

Paper Structure (19 sections, 35 theorems, 109 equations, 4 figures, 3 algorithms)

This paper contains 19 sections, 35 theorems, 109 equations, 4 figures, 3 algorithms.

Introduction
Preliminaries
Problem Formulation
Related Work
Technical Background
Q-learning.
Method and Theoretical Results
Proposed Method: SUBSAMPLE-Q
Theoretical Guarantee
Proof Outline
Step 2: Bounding Total Variation (TV) Distance.
Experiments
Conclusion, Limitations, and Future Work
Acknowledgements
Mathematical Background and Additional Remarks
...and 4 more sections

Key Result

Lemma 3.3

For all $k\in [n]$ and $m\in\mathbb{N}$, where $m$ is the number of samples in equation: empirical_adapted_bellman, there exists a Bellman noise $\epsilon_{k,m}$ such that $\|\hat{\mathcal{T}}_{k,m}\hat{Q}_{k,m}^{\mathrm{est}} - \hat{\mathcal{T}}_k\hat{Q}_k^*\|_\infty = \|\hat{Q}_{k,m}^\mathrm{est}

Figures (4)

Figure 1: Demand-Response simulation. a) Computation time to learn $\hat{\pi}_{k,m}^\mathrm{est}$ for $k\!\leq\!n\!=\!8$. b) Reward optimality gap (log scale) with ${\pi}_{k,m}^\mathrm{est}$ running $300$ iterations for $k\leq n=8$, c) Discounted cumulative rewards for $k\!\leq\!n\!=\!50$. We note that $k\!=\!n$ recovers the mean-field RL iteration solution.
Figure 2: Reward optimality gap (log scale) with ${\pi}_{k,m}^\mathrm{est}$ running $300$ iterations.
Figure 3: Flow of the algorithm and relevant analyses in learning $Q^*$. Here, (1) follows by performing \ref{['algorithm: approx-dense-tolerable-Q-learning']} (SUBSAMPLE-Q: Learning) on $\hat{Q}_{k,m}^0$. (2) follows from \ref{['assumption:qest_qhat_error']}. (3) follows from the Lipschitz continuity and total variation distance bounds in \ref{['thm:lip', 'thm:tvd']}. Finally, (4) follows from noting that $\hat{Q}_n^* = Q^*$.
Figure 4: Star graph $S_n$

Theorems & Definitions (79)

Definition 2.1: $\epsilon$-optimal policy
Remark 2.2
Definition 3.1: Empirical Distribution Function
Remark 3.2
Lemma 3.3: Theorem 1 of 9570295
Theorem 3.4
Corollary 3.5
Theorem 4.1: Lipschitz continuity in $\hat{Q}_k^*$
Theorem 4.2
Theorem 4.3
...and 69 more

Efficient Reinforcement Learning for Global Decision Making in the Presence of Local Agents at Scale

TL;DR

Abstract

Efficient Reinforcement Learning for Global Decision Making in the Presence of Local Agents at Scale

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (79)