Minimax Optimal Strategy for Delayed Observations in Online Reinforcement Learning

Harin Lee; Kevin Jamieson

Minimax Optimal Strategy for Delayed Observations in Online Reinforcement Learning

Harin Lee, Kevin Jamieson

TL;DR

An algorithm that combines the augmentation method and the upper confidence bound approach for tabular Markov decision processes (MDPs) is proposed, and a matching lower bound up to logarithmic factors is provided, showing the optimality of the approach.

Abstract

We study reinforcement learning with delayed state observation, where the agent observes the current state after some random number of time steps. We propose an algorithm that combines the augmentation method and the upper confidence bound approach. For tabular Markov decision processes (MDPs), we derive a regret bound of $\tilde{\mathcal{O}}(H \sqrt{D_{\max} SAK})$, where $S$ and $A$ are the cardinalities of the state and action spaces, $H$ is the time horizon, $K$ is the number of episodes, and $D_{\max}$ is the maximum length of the delay. We also provide a matching lower bound up to logarithmic factors, showing the optimality of our approach. Our analytical framework formulates this problem as a special case of a broader class of MDPs, where their transition dynamics decompose into a known component and an unknown but structured component. We establish general results for this abstract setting, which may be of independent interest.

Minimax Optimal Strategy for Delayed Observations in Online Reinforcement Learning

TL;DR

Abstract

, where

and

are the cardinalities of the state and action spaces,

is the time horizon,

is the number of episodes, and

is the maximum length of the delay. We also provide a matching lower bound up to logarithmic factors, showing the optimality of our approach. Our analytical framework formulates this problem as a special case of a broader class of MDPs, where their transition dynamics decompose into a known component and an unknown but structured component. We establish general results for this abstract setting, which may be of independent interest.

Paper Structure (47 sections, 36 theorems, 135 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 47 sections, 36 theorems, 135 equations, 3 figures, 2 tables, 1 algorithm.

Introduction
Related Work
RL under Tabular MDPs.
MDPs with Delayed Observations.
Delayed feedback in bandits and RL.
Preliminaries
Notations
Markov Decision Process
Problem Setting
Constant Delayed MDP (CDMDP).
Algorithm
Augmented Markov Decision Process
Algorithm
Theoretical Guarantees
Discussion of \ref{['thm:known delay', 'thm:unknown delay']}.
...and 32 more sections

Key Result

Theorem 1

Suppose the delay distribution ${P_{\mathrm{delay}}}$ is known. With probability at least $1 - \delta$, alg:MVP delay informal achieves the regret bound of $\mathcal{O}(H\sqrt{(D_{\mathrm{max}} \land B) SAK} \iota + H B S A \iota^2)$.

Figures (3)

Figure 1: Illustration of the transition of the augmented MDP after taking one action. Gray states indicates intermediate states that have no actions. Straight lines represents the agent's action and dotted lines represents augmented state transitions. $p_{t_h,\widetilde{\Delta}}$ is shorthand for $P_{\mathtt{tran}}(s_{t_h}, a_{t_h}, \widetilde{\Delta})$, $p_{t_h+1, -1}$ is shorthand for $P_{\mathtt{tran}}(s_{t_h+1}, a_{t_h+1}, -1)$.
Figure 2: Illustration of core properties of the augmented MDP's state transition. Consider a transition accompanied by an augmented state-action pair of $((s_{t_h}; a_{t_h}, \ldots, a_{h-1}), a_h)$. The orange-shaded part indicates that the transition dynamics for the action queue are known, which is simply shifting from the previous action queue. The blue-shaded part indicates that the unknown part of the state transition is determined only by $(s_{t_h}, a_{t_h})$ and is irrelevant of the other part of the augmented state-action pair.
Figure 3: Illustration of hard instances for \ref{['thm:regret lower bound']}. The structure consists of a tree structure (top) and a CodeMDP (bottom). The leaf states are labeled $l_1, l_2, l_3, l_4$. Each leaf state and action pair has its own probability distribution over the states in the CodeMDP. Once the agent takes an action from the leaf state, it must make $\widetilde{D}$ actions without observing which state it has landed in. The agent enters the success state $s_{\mathrm{succ}}$ if it landed at state $(i, b)$ and the $i$-th out of $\widetilde{D}$ actions is $b$, and it receives the reward. The agent enters the fail state $s_{\mathrm{fail}}$ if it landed at state $(i, b)$ and the $i$-th out of $\widetilde{D}$ actions is not $b$, and it cannot receive any reward.

Theorems & Definitions (40)

Remark 1
Theorem 1
Theorem 2
Remark 2
Theorem 3: Lower bound result
Proposition 1
Theorem 4: Restatement of Theorem 6 in burago1996complexity
Definition 1: MDPs with partially known dynamics
Remark 3
Theorem 5
...and 30 more

Minimax Optimal Strategy for Delayed Observations in Online Reinforcement Learning

TL;DR

Abstract

Minimax Optimal Strategy for Delayed Observations in Online Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (40)