Table of Contents
Fetching ...

Tree Search-Based Policy Optimization under Stochastic Execution Delay

David Valensi, Esther Derman, Shie Mannor, Gal Dalal

TL;DR

The paper tackles reinforcement learning under stochastic execution delays by formulating stochastic execution-delay MDPs (SED-MDPs) that avoid state augmentation. It proves that, when delay realizations are observable, policy optimization can be restricted to Markov policies, enabling tractable learning. The authors introduce DEZ, a model-based method built on EfficientZero that uses a forward model and Monte Carlo tree search to infer future states under delays and optimize decisions via a modified policy objective. Empirical results on Atari show DEZ outperforms baselines in both constant and stochastic delays, demonstrating strong sample efficiency and robustness to delay randomness; code and proofs are provided to support reproducibility.

Abstract

The standard formulation of Markov decision processes (MDPs) assumes that the agent's decisions are executed immediately. However, in numerous realistic applications such as robotics or healthcare, actions are performed with a delay whose value can even be stochastic. In this work, we introduce stochastic delayed execution MDPs, a new formalism addressing random delays without resorting to state augmentation. We show that given observed delay values, it is sufficient to perform a policy search in the class of Markov policies in order to reach optimal performance, thus extending the deterministic fixed delay case. Armed with this insight, we devise DEZ, a model-based algorithm that optimizes over the class of Markov policies. DEZ leverages Monte-Carlo tree search similar to its non-delayed variant EfficientZero to accurately infer future states from the action queue. Thus, it handles delayed execution while preserving the sample efficiency of EfficientZero. Through a series of experiments on the Atari suite, we demonstrate that although the previous baseline outperforms the naive method in scenarios with constant delay, it underperforms in the face of stochastic delays. In contrast, our approach significantly outperforms the baselines, for both constant and stochastic delays. The code is available at http://github.com/davidva1/Delayed-EZ .

Tree Search-Based Policy Optimization under Stochastic Execution Delay

TL;DR

The paper tackles reinforcement learning under stochastic execution delays by formulating stochastic execution-delay MDPs (SED-MDPs) that avoid state augmentation. It proves that, when delay realizations are observable, policy optimization can be restricted to Markov policies, enabling tractable learning. The authors introduce DEZ, a model-based method built on EfficientZero that uses a forward model and Monte Carlo tree search to infer future states under delays and optimize decisions via a modified policy objective. Empirical results on Atari show DEZ outperforms baselines in both constant and stochastic delays, demonstrating strong sample efficiency and robustness to delay randomness; code and proofs are provided to support reproducibility.

Abstract

The standard formulation of Markov decision processes (MDPs) assumes that the agent's decisions are executed immediately. However, in numerous realistic applications such as robotics or healthcare, actions are performed with a delay whose value can even be stochastic. In this work, we introduce stochastic delayed execution MDPs, a new formalism addressing random delays without resorting to state augmentation. We show that given observed delay values, it is sufficient to perform a policy search in the class of Markov policies in order to reach optimal performance, thus extending the deterministic fixed delay case. Armed with this insight, we devise DEZ, a model-based algorithm that optimizes over the class of Markov policies. DEZ leverages Monte-Carlo tree search similar to its non-delayed variant EfficientZero to accurately infer future states from the action queue. Thus, it handles delayed execution while preserving the sample efficiency of EfficientZero. Through a series of experiments on the Atari suite, we demonstrate that although the previous baseline outperforms the naive method in scenarios with constant delay, it underperforms in the face of stochastic delays. In contrast, our approach significantly outperforms the baselines, for both constant and stochastic delays. The code is available at http://github.com/davidva1/Delayed-EZ .
Paper Structure (22 sections, 4 theorems, 27 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 22 sections, 4 theorems, 27 equations, 7 figures, 2 tables, 1 algorithm.

Key Result

Theorem 4.1

For any policy $\pi := (\pi_t)_{t\in\mathbb{N}} \in \Pi^\textsc{HR}$, the probability of observing history $h_t := (s_0,z_0,a_0,\cdots, a_{t-1},s_t, z_t)$ is given by:

Figures (7)

  • Figure 1: Pending queue resolution in a SED-MDP. The policy input $\hat{s}_{t+z_t}$ corresponds to the state inferred at $t$ by a forward model (see Section \ref{['sec: sdez']}). For clarity, effective decision times are shown for $t\in\{5,6\}$ only.
  • Figure 2: Interaction diagram between DEZ and the delayed environment
  • Figure 3: Average score on 15 Atari games and delays $M\in\{5,15,25\}$ over 32 test episodes per trained seed. Delays appear from low to high values for each game. Left: Constant delay value; Right: Stochastic delay value within $\{0,\cdots, M\}$.
  • Figure 4: Random walk behavior of the stochastic delay across multiple episodes. No initial delay value is set at the start of the episode. Here, the maximal delay is 15.
  • Figure 5: Convergence plots for 15 Atari games on constant delays in $\{5,15,25\}$.
  • ...and 2 more figures

Theorems & Definitions (7)

  • Theorem 4.1
  • Theorem 4.2
  • proof
  • Lemma A.1
  • proof
  • Theorem
  • proof