Deep reinforcement learning for weakly coupled MDP's with continuous actions
Francisco Robledo, Urtzi Ayesta, Konstantin Avrachenkov
TL;DR
This work tackles constrained decision-making in weakly coupled MDPs with continuous actions by introducing the Lagrange Policy for Continuous Actions (LPCA). LPCA employs a Lagrangian relaxation that decouples project-level decisions, combined with a neural-network approximation of per-project $Q_i(s_i,a_i,\lambda)$ values, and a one-dimensional convex optimization to obtain the optimal $\lambda^*$ under budget $B$. Action selection then solves a knapsack-like problem using either Differential Evolution (LPCA-DE) or Greedy gradient-based (LPCA-Greedy) strategies, balancing immediate rewards against costs. Empirical results show LPCA variants consistently outperform DDPG with OptLayer and align closely with Whittle-index policies, particularly as the number of projects and resource constraints increase, highlighting robustness and scalability for continuous-action, resource-constrained MDPs.
Abstract
This paper introduces the Lagrange Policy for Continuous Actions (LPCA), a reinforcement learning algorithm specifically designed for weakly coupled MDP problems with continuous action spaces. LPCA addresses the challenge of resource constraints dependent on continuous actions by introducing a Lagrange relaxation of the weakly coupled MDP problem within a neural network framework for Q-value computation. This approach effectively decouples the MDP, enabling efficient policy learning in resource-constrained environments. We present two variations of LPCA: LPCA-DE, which utilizes differential evolution for global optimization, and LPCA-Greedy, a method that incrementally and greadily selects actions based on Q-value gradients. Comparative analysis against other state-of-the-art techniques across various settings highlight LPCA's robustness and efficiency in managing resource allocation while maximizing rewards.
