Control in Stochastic Environment with Delays: A Model-based Reinforcement Learning Approach

Zhiyuan Yao; Ionut Florescu; Chihoon Lee

Control in Stochastic Environment with Delays: A Model-based Reinforcement Learning Approach

Zhiyuan Yao, Ionut Florescu, Chihoon Lee

TL;DR

This paper develops Stochastic Model Based Simulation (SMBS) to control systems with delayed feedback and stochastic transitions by sampling multiple possible target states from a probabilistic environment model. The action policy combines mean Q-values with a risk penalty, ${\bar{Q}_M(a)} - {\alpha}{\hat{Q}_M(a)}$, enabling risk-aware planning in delay-prone settings. SMBS demonstrates robustness and often superior performance compared with AMDP and Delayed-Q across classic control tasks and Atari environments, and its risk parameter $\alpha$ provides tunable conservatism under uncertainty. Theoretical results establish equivalence to AMDP in deterministic cases and provide probabilistic error bounds as the number of samples grows, supporting practical applicability in real-world delayed control scenarios.

Abstract

In this paper we are introducing a new reinforcement learning method for control problems in environments with delayed feedback. Specifically, our method employs stochastic planning, versus previous methods that used deterministic planning. This allows us to embed risk preference in the policy optimization problem. We show that this formulation can recover the optimal policy for problems with deterministic transitions. We contrast our policy with two prior methods from literature. We apply the methodology to simple tasks to understand its features. Then, we compare the performance of the methods in controlling multiple Atari games.

Control in Stochastic Environment with Delays: A Model-based Reinforcement Learning Approach

TL;DR

, enabling risk-aware planning in delay-prone settings. SMBS demonstrates robustness and often superior performance compared with AMDP and Delayed-Q across classic control tasks and Atari environments, and its risk parameter

provides tunable conservatism under uncertainty. Theoretical results establish equivalence to AMDP in deterministic cases and provide probabilistic error bounds as the number of samples grows, supporting practical applicability in real-world delayed control scenarios.

Abstract

Paper Structure (13 sections, 2 theorems, 21 equations, 9 figures, 2 algorithms)

This paper contains 13 sections, 2 theorems, 21 equations, 9 figures, 2 algorithms.

Introduction
Preliminaries
Stochastic Model Based Simulation (SMBS)
Experiments
Tasks
Training/Evaluation Procedure
Results
Atari Learning Environments
Risk Parameter $\alpha$
Conclusion
Appendix
Proof of Theorem 1
Proof of Theorem 2

Key Result

Theorem 1

Assume a discrete-time MDP with an infinite time horizon. The Markovian movement is deterministic, i.e., for arbitrary $(s, a)\in \mathcal{S}\times \mathcal{A}$, $t\geq 0$, there exists an $s'\in\mathcal{S}$ such that $P(S_{t+1} = s'\mid S_t = s, A_t = a) = 1$ for all $t=0,1, \ldots$ Then, the polic where $\Tilde{q}^*$ denotes the optimal Q-function for the AMDP.

Figures (9)

Figure 1: Illustration of control in real-time applications.
Figure 2: The stochastic environment evolution for 5 delay steps.
Figure 3: An illustration of the policy function of the SMBS method.
Figure 4: Illustrations of tasks used for comparison.
Figure 5: Illustrations of tasks used for comparison.
...and 4 more figures

Theorems & Definitions (4)

Theorem 1
Theorem 2
proof
proof

Control in Stochastic Environment with Delays: A Model-based Reinforcement Learning Approach

TL;DR

Abstract

Control in Stochastic Environment with Delays: A Model-based Reinforcement Learning Approach

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (4)