Table of Contents
Fetching ...

Regret-Optimal Q-Learning with Low Cost for Single-Agent and Federated Reinforcement Learning

Haochen Zhang, Zhong Zheng, Lingzhou Xue

TL;DR

Two novel model-free RL algorithms are proposed -- Q-EarlySettled-LowCost and FedQ-EarlySettled-LowCost -- that are the first in the literature to simultaneously achieve the best near-optimal regret among all known model-free RL or FRL algorithms.

Abstract

Motivated by real-world settings where data collection and policy deployment -- whether for a single agent or across multiple agents -- are costly, we study the problem of on-policy single-agent reinforcement learning (RL) and federated RL (FRL) with a focus on minimizing burn-in costs (the sample sizes needed to reach near-optimal regret) and policy switching or communication costs. In parallel finite-horizon episodic Markov Decision Processes (MDPs) with $S$ states and $A$ actions, existing methods either require superlinear burn-in costs in $S$ and $A$ or fail to achieve logarithmic switching or communication costs. We propose two novel model-free RL algorithms -- Q-EarlySettled-LowCost and FedQ-EarlySettled-LowCost -- that are the first in the literature to simultaneously achieve: (i) the best near-optimal regret among all known model-free RL or FRL algorithms, (ii) low burn-in cost that scales linearly with $S$ and $A$, and (iii) logarithmic policy switching cost for single-agent RL or communication cost for FRL. Additionally, we establish gap-dependent theoretical guarantees for both regret and switching/communication costs, improving or matching the best-known gap-dependent bounds.

Regret-Optimal Q-Learning with Low Cost for Single-Agent and Federated Reinforcement Learning

TL;DR

Two novel model-free RL algorithms are proposed -- Q-EarlySettled-LowCost and FedQ-EarlySettled-LowCost -- that are the first in the literature to simultaneously achieve the best near-optimal regret among all known model-free RL or FRL algorithms.

Abstract

Motivated by real-world settings where data collection and policy deployment -- whether for a single agent or across multiple agents -- are costly, we study the problem of on-policy single-agent reinforcement learning (RL) and federated RL (FRL) with a focus on minimizing burn-in costs (the sample sizes needed to reach near-optimal regret) and policy switching or communication costs. In parallel finite-horizon episodic Markov Decision Processes (MDPs) with states and actions, existing methods either require superlinear burn-in costs in and or fail to achieve logarithmic switching or communication costs. We propose two novel model-free RL algorithms -- Q-EarlySettled-LowCost and FedQ-EarlySettled-LowCost -- that are the first in the literature to simultaneously achieve: (i) the best near-optimal regret among all known model-free RL or FRL algorithms, (ii) low burn-in cost that scales linearly with and , and (iii) logarithmic policy switching cost for single-agent RL or communication cost for FRL. Additionally, we establish gap-dependent theoretical guarantees for both regret and switching/communication costs, improving or matching the best-known gap-dependent bounds.

Paper Structure

This paper contains 34 sections, 34 theorems, 423 equations, 7 figures, 4 tables, 2 algorithms.

Key Result

Theorem 4.1

For any $p \in (0,1)$, let $\iota_0 = \log(SAT/p)$. Then for Q-EarlySettled-LowCost (alg_early_serveralg_early_agent with $M=1$ and $\beta \in (0,H]$), with probability at least $1-p$, we have

Figures (7)

  • Figure 1: Numerical comparison of regrets for single-agent model-free algorithms
  • Figure 2: Switching cost results for Q-EarlySettled-LowCost when $M=1$
  • Figure 3: Numerical comparison of regrets for federated model-free algorithms
  • Figure 4: Number of communication rounds for FedQ-EarlySettled-LowCost
  • Figure 5: Central server broadcast protocol. At the beginning of round $k$, for any state-step pair $(s,h) \in \mathcal{S} \times [H]$, the central server broadcasts the current policy $\pi^k$, the total number of visits before round $k$$N_h^k(s,\pi_h^k(s))$, the $V-$estimates $V_h^k(s,\pi_h^k(s))$, the lower bound function $V_h^{\textnormal{L},k}(s,\pi_h^k(s))$ and the reference function $V_h^{\textnormal{R},k}(s,\pi_h^k(s))$ to each agent.
  • ...and 2 more figures

Theorems & Definitions (62)

  • Definition 2.1
  • Definition 2.2
  • Definition 2.3
  • Theorem 4.1
  • Theorem 4.2
  • Theorem 4.3
  • Theorem 4.4
  • Theorem 4.5
  • Definition 4.6
  • Theorem 4.7
  • ...and 52 more