Table of Contents
Fetching ...

Deep reinforcement learning for weakly coupled MDP's with continuous actions

Francisco Robledo, Urtzi Ayesta, Konstantin Avrachenkov

TL;DR

This work tackles constrained decision-making in weakly coupled MDPs with continuous actions by introducing the Lagrange Policy for Continuous Actions (LPCA). LPCA employs a Lagrangian relaxation that decouples project-level decisions, combined with a neural-network approximation of per-project $Q_i(s_i,a_i,\lambda)$ values, and a one-dimensional convex optimization to obtain the optimal $\lambda^*$ under budget $B$. Action selection then solves a knapsack-like problem using either Differential Evolution (LPCA-DE) or Greedy gradient-based (LPCA-Greedy) strategies, balancing immediate rewards against costs. Empirical results show LPCA variants consistently outperform DDPG with OptLayer and align closely with Whittle-index policies, particularly as the number of projects and resource constraints increase, highlighting robustness and scalability for continuous-action, resource-constrained MDPs.

Abstract

This paper introduces the Lagrange Policy for Continuous Actions (LPCA), a reinforcement learning algorithm specifically designed for weakly coupled MDP problems with continuous action spaces. LPCA addresses the challenge of resource constraints dependent on continuous actions by introducing a Lagrange relaxation of the weakly coupled MDP problem within a neural network framework for Q-value computation. This approach effectively decouples the MDP, enabling efficient policy learning in resource-constrained environments. We present two variations of LPCA: LPCA-DE, which utilizes differential evolution for global optimization, and LPCA-Greedy, a method that incrementally and greadily selects actions based on Q-value gradients. Comparative analysis against other state-of-the-art techniques across various settings highlight LPCA's robustness and efficiency in managing resource allocation while maximizing rewards.

Deep reinforcement learning for weakly coupled MDP's with continuous actions

TL;DR

This work tackles constrained decision-making in weakly coupled MDPs with continuous actions by introducing the Lagrange Policy for Continuous Actions (LPCA). LPCA employs a Lagrangian relaxation that decouples project-level decisions, combined with a neural-network approximation of per-project values, and a one-dimensional convex optimization to obtain the optimal under budget . Action selection then solves a knapsack-like problem using either Differential Evolution (LPCA-DE) or Greedy gradient-based (LPCA-Greedy) strategies, balancing immediate rewards against costs. Empirical results show LPCA variants consistently outperform DDPG with OptLayer and align closely with Whittle-index policies, particularly as the number of projects and resource constraints increase, highlighting robustness and scalability for continuous-action, resource-constrained MDPs.

Abstract

This paper introduces the Lagrange Policy for Continuous Actions (LPCA), a reinforcement learning algorithm specifically designed for weakly coupled MDP problems with continuous action spaces. LPCA addresses the challenge of resource constraints dependent on continuous actions by introducing a Lagrange relaxation of the weakly coupled MDP problem within a neural network framework for Q-value computation. This approach effectively decouples the MDP, enabling efficient policy learning in resource-constrained environments. We present two variations of LPCA: LPCA-DE, which utilizes differential evolution for global optimization, and LPCA-Greedy, a method that incrementally and greadily selects actions based on Q-value gradients. Comparative analysis against other state-of-the-art techniques across various settings highlight LPCA's robustness and efficiency in managing resource allocation while maximizing rewards.
Paper Structure (9 sections, 10 equations, 3 figures, 5 algorithms)

This paper contains 9 sections, 10 equations, 3 figures, 5 algorithms.

Figures (3)

  • Figure 1: Experimental results for Type A environment: (Left) 4 projects and 2 units of resources, (Right) 6 projects and 4 units of resources.
  • Figure 2: Experimental results for Type B environment: (Left) 4 projects and 2 units of resources, (Right) 6 projects and 4 units of resources.
  • Figure 3: (Left) Speed Scaling with 4 projects and 1.5 units of resources, (Right) Mixed Type A and B environments with 6 projects and 4 units of resources.