Table of Contents
Fetching ...

Efficient Reinforcement Learning for Global Decision Making in the Presence of Local Agents at Scale

Emile Anand, Guannan Qu

TL;DR

This work proposes the SUBSAMPLE-Q algorithm where the global agent subsamples n local agents to compute a policy in time that is polynomial in $k$ and shows that this learned policy converges to the optimal policy in the order of $\tilde{O}(1/\sqrt{k}+{\epsilon}_{k,m})$ as the number of sub-sampled agents $k$ increases.

Abstract

We study reinforcement learning for global decision-making in the presence of local agents, where the global decision-maker makes decisions affecting all local agents, and the objective is to learn a policy that maximizes the joint rewards of all the agents. Such problems find many applications, e.g. demand response, EV charging, queueing, etc. In this setting, scalability has been a long-standing challenge due to the size of the state space which can be exponential in the number of agents. This work proposes the \texttt{SUBSAMPLE-Q} algorithm where the global agent subsamples $k\leq n$ local agents to compute a policy in time that is polynomial in $k$. We show that this learned policy converges to the optimal policy in the order of $\tilde{O}(1/\sqrt{k}+ε_{k,m})$ as the number of sub-sampled agents $k$ increases, where $ε_{k,m}$ is the Bellman noise. Finally, we validate the theory through numerical simulations in a demand-response setting and a queueing setting.

Efficient Reinforcement Learning for Global Decision Making in the Presence of Local Agents at Scale

TL;DR

This work proposes the SUBSAMPLE-Q algorithm where the global agent subsamples n local agents to compute a policy in time that is polynomial in and shows that this learned policy converges to the optimal policy in the order of as the number of sub-sampled agents increases.

Abstract

We study reinforcement learning for global decision-making in the presence of local agents, where the global decision-maker makes decisions affecting all local agents, and the objective is to learn a policy that maximizes the joint rewards of all the agents. Such problems find many applications, e.g. demand response, EV charging, queueing, etc. In this setting, scalability has been a long-standing challenge due to the size of the state space which can be exponential in the number of agents. This work proposes the \texttt{SUBSAMPLE-Q} algorithm where the global agent subsamples local agents to compute a policy in time that is polynomial in . We show that this learned policy converges to the optimal policy in the order of as the number of sub-sampled agents increases, where is the Bellman noise. Finally, we validate the theory through numerical simulations in a demand-response setting and a queueing setting.
Paper Structure (19 sections, 35 theorems, 109 equations, 4 figures, 3 algorithms)

This paper contains 19 sections, 35 theorems, 109 equations, 4 figures, 3 algorithms.

Key Result

Lemma 3.3

For all $k\in [n]$ and $m\in\mathbb{N}$, where $m$ is the number of samples in equation: empirical_adapted_bellman, there exists a Bellman noise $\epsilon_{k,m}$ such that $\|\hat{\mathcal{T}}_{k,m}\hat{Q}_{k,m}^{\mathrm{est}} - \hat{\mathcal{T}}_k\hat{Q}_k^*\|_\infty = \|\hat{Q}_{k,m}^\mathrm{est}

Figures (4)

  • Figure 1: Demand-Response simulation. a) Computation time to learn $\hat{\pi}_{k,m}^\mathrm{est}$ for $k\!\leq\!n\!=\!8$. b) Reward optimality gap (log scale) with ${\pi}_{k,m}^\mathrm{est}$ running $300$ iterations for $k\leq n=8$, c) Discounted cumulative rewards for $k\!\leq\!n\!=\!50$. We note that $k\!=\!n$ recovers the mean-field RL iteration solution.
  • Figure 2: Reward optimality gap (log scale) with ${\pi}_{k,m}^\mathrm{est}$ running $300$ iterations.
  • Figure 3: Flow of the algorithm and relevant analyses in learning $Q^*$. Here, (1) follows by performing \ref{['algorithm: approx-dense-tolerable-Q-learning']} (SUBSAMPLE-Q: Learning) on $\hat{Q}_{k,m}^0$. (2) follows from \ref{['assumption:qest_qhat_error']}. (3) follows from the Lipschitz continuity and total variation distance bounds in \ref{['thm:lip', 'thm:tvd']}. Finally, (4) follows from noting that $\hat{Q}_n^* = Q^*$.
  • Figure 4: Star graph $S_n$

Theorems & Definitions (79)

  • Definition 2.1: $\epsilon$-optimal policy
  • Remark 2.2
  • Definition 3.1: Empirical Distribution Function
  • Remark 3.2
  • Lemma 3.3: Theorem 1 of 9570295
  • Theorem 3.4
  • Corollary 3.5
  • Theorem 4.1: Lipschitz continuity in $\hat{Q}_k^*$
  • Theorem 4.2
  • Theorem 4.3
  • ...and 69 more