A Reinforcement Learning Method for Environments with Stochastic Variables: Post-Decision Proximal Policy Optimization with Dual Critic Networks
Leonardo Kanashiro Felizardo, Edoardo Fadda, Paolo Brandimarte, Emilio Del-Moral-Hernandez, Mariá Cristina Vasconcelos Nascimento
TL;DR
The paper addresses reinforcement learning in environments with stochastic state transitions, where Proximal Policy Optimization (PPO) struggles with exploration and stability. It introduces Post-Decision Proximal Policy Optimization (PDPPO), which partitions transitions into a deterministic post-decision step and a stochastic step, and employs dual critics—one for the current state and one for the post-decision state—to improve value-function estimation. The key innovations are the use of post-decision states $s^x$ and the post-decision value function $V^{\\pi, x}(s^x)$, combined with a max-advantage update $A^{\\pi}_t(s_t)=\max(A^{\\pi,x}_t(s_t),A^{\\pi,pre}_t(s_t))$ and separate optimizers for each critic. Empirical results in the Frozen Lake and Stochastic Discrete Lot-sizing environments show PDPPO outperforming vanilla PPO, including faster convergence, higher rewards, and reduced sensitivity to initialization, with PDPPO and its dual-critic variant offering robust performance in high-dimensional, stochastic tasks. These findings suggest that incorporating post-decision information and dual critics can significantly enhance learning efficiency and reliability in complex real-world decision problems.
Abstract
This paper presents Post-Decision Proximal Policy Optimization (PDPPO), a novel variation of the leading deep reinforcement learning method, Proximal Policy Optimization (PPO). The PDPPO state transition process is divided into two steps: a deterministic step resulting in the post-decision state and a stochastic step leading to the next state. Our approach incorporates post-decision states and dual critics to reduce the problem's dimensionality and enhance the accuracy of value function estimation. Lot-sizing is a mixed integer programming problem for which we exemplify such dynamics. The objective of lot-sizing is to optimize production, delivery fulfillment, and inventory levels in uncertain demand and cost parameters. This paper evaluates the performance of PDPPO across various environments and configurations. Notably, PDPPO with a dual critic architecture achieves nearly double the maximum reward of vanilla PPO in specific scenarios, requiring fewer episode iterations and demonstrating faster and more consistent learning across different initializations. On average, PDPPO outperforms PPO in environments with a stochastic component in the state transition. These results support the benefits of using a post-decision state. Integrating this post-decision state in the value function approximation leads to more informed and efficient learning in high-dimensional and stochastic environments.
