Table of Contents
Fetching ...

Minimax Optimal Strategy for Delayed Observations in Online Reinforcement Learning

Harin Lee, Kevin Jamieson

TL;DR

An algorithm that combines the augmentation method and the upper confidence bound approach for tabular Markov decision processes (MDPs) is proposed, and a matching lower bound up to logarithmic factors is provided, showing the optimality of the approach.

Abstract

We study reinforcement learning with delayed state observation, where the agent observes the current state after some random number of time steps. We propose an algorithm that combines the augmentation method and the upper confidence bound approach. For tabular Markov decision processes (MDPs), we derive a regret bound of $\tilde{\mathcal{O}}(H \sqrt{D_{\max} SAK})$, where $S$ and $A$ are the cardinalities of the state and action spaces, $H$ is the time horizon, $K$ is the number of episodes, and $D_{\max}$ is the maximum length of the delay. We also provide a matching lower bound up to logarithmic factors, showing the optimality of our approach. Our analytical framework formulates this problem as a special case of a broader class of MDPs, where their transition dynamics decompose into a known component and an unknown but structured component. We establish general results for this abstract setting, which may be of independent interest.

Minimax Optimal Strategy for Delayed Observations in Online Reinforcement Learning

TL;DR

An algorithm that combines the augmentation method and the upper confidence bound approach for tabular Markov decision processes (MDPs) is proposed, and a matching lower bound up to logarithmic factors is provided, showing the optimality of the approach.

Abstract

We study reinforcement learning with delayed state observation, where the agent observes the current state after some random number of time steps. We propose an algorithm that combines the augmentation method and the upper confidence bound approach. For tabular Markov decision processes (MDPs), we derive a regret bound of , where and are the cardinalities of the state and action spaces, is the time horizon, is the number of episodes, and is the maximum length of the delay. We also provide a matching lower bound up to logarithmic factors, showing the optimality of our approach. Our analytical framework formulates this problem as a special case of a broader class of MDPs, where their transition dynamics decompose into a known component and an unknown but structured component. We establish general results for this abstract setting, which may be of independent interest.
Paper Structure (47 sections, 36 theorems, 135 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 47 sections, 36 theorems, 135 equations, 3 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Suppose the delay distribution ${P_{\mathrm{delay}}}$ is known. With probability at least $1 - \delta$, alg:MVP delay informal achieves the regret bound of $\mathcal{O}(H\sqrt{(D_{\mathrm{max}} \land B) SAK} \iota + H B S A \iota^2)$.

Figures (3)

  • Figure 1: Illustration of the transition of the augmented MDP after taking one action. Gray states indicates intermediate states that have no actions. Straight lines represents the agent's action and dotted lines represents augmented state transitions. $p_{t_h,\widetilde{\Delta}}$ is shorthand for $P_{\mathtt{tran}}(s_{t_h}, a_{t_h}, \widetilde{\Delta})$, $p_{t_h+1, -1}$ is shorthand for $P_{\mathtt{tran}}(s_{t_h+1}, a_{t_h+1}, -1)$.
  • Figure 2: Illustration of core properties of the augmented MDP's state transition. Consider a transition accompanied by an augmented state-action pair of $((s_{t_h}; a_{t_h}, \ldots, a_{h-1}), a_h)$. The orange-shaded part indicates that the transition dynamics for the action queue are known, which is simply shifting from the previous action queue. The blue-shaded part indicates that the unknown part of the state transition is determined only by $(s_{t_h}, a_{t_h})$ and is irrelevant of the other part of the augmented state-action pair.
  • Figure 3: Illustration of hard instances for \ref{['thm:regret lower bound']}. The structure consists of a tree structure (top) and a CodeMDP (bottom). The leaf states are labeled $l_1, l_2, l_3, l_4$. Each leaf state and action pair has its own probability distribution over the states in the CodeMDP. Once the agent takes an action from the leaf state, it must make $\widetilde{D}$ actions without observing which state it has landed in. The agent enters the success state $s_{\mathrm{succ}}$ if it landed at state $(i, b)$ and the $i$-th out of $\widetilde{D}$ actions is $b$, and it receives the reward. The agent enters the fail state $s_{\mathrm{fail}}$ if it landed at state $(i, b)$ and the $i$-th out of $\widetilde{D}$ actions is not $b$, and it cannot receive any reward.

Theorems & Definitions (40)

  • Remark 1
  • Theorem 1
  • Theorem 2
  • Remark 2
  • Theorem 3: Lower bound result
  • Proposition 1
  • Theorem 4: Restatement of Theorem 6 in burago1996complexity
  • Definition 1: MDPs with partially known dynamics
  • Remark 3
  • Theorem 5
  • ...and 30 more