Table of Contents
Fetching ...

A Single-Loop Deep Actor-Critic Algorithm for Constrained Reinforcement Learning with Provable Convergence

Kexuan Wang, An Liu, Baishuo Lin

TL;DR

The paper tackles constrained reinforcement learning for continuous-control CMDPs by proposing SLDAC, a single-loop deep actor-critic framework that uses CSSCA for the actor and a limited-iteration, observation-reuse critic update to drastically reduce environment interactions. It provides theoretical guarantees, proving almost-sure convergence to a KKT point and establishing asymptotic surrogate-function consistency with finite-time convergence rates for the critic networks. The approach relies on dual critics, TD learning, and carefully designed step-size schedules to manage bias from the single-loop design. Empirically, SLDAC outperforms baselines in three CRL domains (delay-constrained power control, safe robot navigation, and constrained LQR) while achieving significantly lower interaction costs.

Abstract

Deep Actor-Critic algorithms, which combine Actor-Critic with deep neural network (DNN), have been among the most prevalent reinforcement learning algorithms for decision-making problems in simulated environments. However, the existing deep Actor-Critic algorithms are still not mature to solve realistic problems with non-convex stochastic constraints and high cost to interact with the environment. In this paper, we propose a single-loop deep Actor-Critic (SLDAC) algorithmic framework for general constrained reinforcement learning (CRL) problems. In the actor step, the constrained stochastic successive convex approximation (CSSCA) method is applied to handle the non-convex stochastic objective and constraints. In the critic step, the critic DNNs are only updated once or a few finite times for each iteration, which simplifies the algorithm to a single-loop framework (the existing works require a sufficient number of updates for the critic step to ensure a good enough convergence of the inner loop for each iteration). Moreover, the variance of the policy gradient estimation is reduced by reusing observations from the old policy. The single-loop design and the observation reuse effectively reduce the agent-environment interaction cost and computational complexity. In spite of the biased policy gradient estimation incurred by the single-loop design and observation reuse, we prove that the SLDAC with a feasible initial point can converge to a Karush-Kuhn-Tuker (KKT) point of the original problem almost surely. Simulations show that the SLDAC algorithm can achieve superior performance with much lower interaction cost.

A Single-Loop Deep Actor-Critic Algorithm for Constrained Reinforcement Learning with Provable Convergence

TL;DR

The paper tackles constrained reinforcement learning for continuous-control CMDPs by proposing SLDAC, a single-loop deep actor-critic framework that uses CSSCA for the actor and a limited-iteration, observation-reuse critic update to drastically reduce environment interactions. It provides theoretical guarantees, proving almost-sure convergence to a KKT point and establishing asymptotic surrogate-function consistency with finite-time convergence rates for the critic networks. The approach relies on dual critics, TD learning, and carefully designed step-size schedules to manage bias from the single-loop design. Empirically, SLDAC outperforms baselines in three CRL domains (delay-constrained power control, safe robot navigation, and constrained LQR) while achieving significantly lower interaction costs.

Abstract

Deep Actor-Critic algorithms, which combine Actor-Critic with deep neural network (DNN), have been among the most prevalent reinforcement learning algorithms for decision-making problems in simulated environments. However, the existing deep Actor-Critic algorithms are still not mature to solve realistic problems with non-convex stochastic constraints and high cost to interact with the environment. In this paper, we propose a single-loop deep Actor-Critic (SLDAC) algorithmic framework for general constrained reinforcement learning (CRL) problems. In the actor step, the constrained stochastic successive convex approximation (CSSCA) method is applied to handle the non-convex stochastic objective and constraints. In the critic step, the critic DNNs are only updated once or a few finite times for each iteration, which simplifies the algorithm to a single-loop framework (the existing works require a sufficient number of updates for the critic step to ensure a good enough convergence of the inner loop for each iteration). Moreover, the variance of the policy gradient estimation is reduced by reusing observations from the old policy. The single-loop design and the observation reuse effectively reduce the agent-environment interaction cost and computational complexity. In spite of the biased policy gradient estimation incurred by the single-loop design and observation reuse, we prove that the SLDAC with a feasible initial point can converge to a Karush-Kuhn-Tuker (KKT) point of the original problem almost surely. Simulations show that the SLDAC algorithm can achieve superior performance with much lower interaction cost.
Paper Structure (34 sections, 5 theorems, 80 equations, 3 figures, 2 tables, 1 algorithm)

This paper contains 34 sections, 5 theorems, 80 equations, 3 figures, 2 tables, 1 algorithm.

Key Result

Lemma 1

(Convergence rate of the Critic Module) Suppose Assumption 1 and Assumption 3 hold, the width of critic DNNs is $m_{Q}$, and the radius $R_{\boldsymbol{\omega}}$ of the parameter constraint set $\mathbf{\Omega}=\mathbb{B}\left(\boldsymbol{\omega}_{0}^{i},R_{\boldsymbol{\omega}}\right)$ is specifical with almost probability 1, where $\epsilon_{m_{Q}}\triangleq O\bigl(m_{Q}^{-1/6}\sqrt{\mathrm{log}\

Figures (3)

  • Figure 1: An autonomous vehicle transport environment provided by the Safety Gym.
  • Figure 2: The algorithmic framework of the proposed SLDAC
  • Figure 3: The first and second rows of images respectively show the learning curves of average rewards and costs in three typical scenarios, respectively, where the lines show the average performance of the algorithms, and the shade regions indicate half the standard deviations.

Theorems & Definitions (7)

  • Remark 1
  • Remark 2
  • Lemma 1
  • Lemma 2
  • Theorem 1
  • Lemma 3
  • Lemma 4