Table of Contents
Fetching ...

Actively Learning Reinforcement Learning: A Stochastic Optimal Control Approach

Mohammad S. Ramadan, Mahmoud A. Hayajnh, Michael T. Tolley, Kyriakos G. Vamvoudakis

TL;DR

This work addresses RL's vulnerability to model mismatch and safety concerns by embedding active information gathering into a stochastic optimal control (SOC) framework. It replaces the intractable, full information-state DP with an EKF-based wide-sense approximation and optimizes a policy via Deterministic Policy Gradient (DDPG) on the reduced state, yielding an agent that autonomously balances caution (safety under uncertainty) and probing (information gathering). Key contributions include (i) a SOC-informed RL formulation with information-state tracking and a SOC-aware objective, (ii) an EKF-based tractable approximation of the information state, and (iii) an SOC-adapted DDPG algorithm with a gradient expression $\nabla_\theta J_\theta = \mathbb{E}[ \nabla_u \mathcal{Q}(\hat{\pi},u) \nabla_\theta \mu_\theta(\hat{\pi}) ]$. A numerical example shows the proposed method stabilizes a nonlinear partially observed system and avoids the divergence observed with certainty-equivalence LQG, while achieving acceptable performance with reasonable computation, highlighting its potential for safer, data-efficient control in real-world RL tasks.

Abstract

In this paper we propose a framework towards achieving two intertwined objectives: (i) equipping reinforcement learning with active exploration and deliberate information gathering, such that it regulates state and parameter uncertainties resulting from modeling mismatches and noisy sensory; and (ii) overcoming the computational intractability of stochastic optimal control. We approach both objectives by using reinforcement learning to compute the stochastic optimal control law. On one hand, we avoid the curse of dimensionality prohibiting the direct solution of the stochastic dynamic programming equation. On the other hand, the resulting stochastic optimal control reinforcement learning agent admits caution and probing, that is, optimal online exploration and exploitation. Unlike fixed exploration and exploitation balance, caution and probing are employed automatically by the controller in real-time, even after the learning process is terminated. We conclude the paper with a numerical simulation, illustrating how a Linear Quadratic Regulator with the certainty equivalence assumption may lead to poor performance and filter divergence, while our proposed approach is stabilizing, of an acceptable performance, and computationally convenient.

Actively Learning Reinforcement Learning: A Stochastic Optimal Control Approach

TL;DR

This work addresses RL's vulnerability to model mismatch and safety concerns by embedding active information gathering into a stochastic optimal control (SOC) framework. It replaces the intractable, full information-state DP with an EKF-based wide-sense approximation and optimizes a policy via Deterministic Policy Gradient (DDPG) on the reduced state, yielding an agent that autonomously balances caution (safety under uncertainty) and probing (information gathering). Key contributions include (i) a SOC-informed RL formulation with information-state tracking and a SOC-aware objective, (ii) an EKF-based tractable approximation of the information state, and (iii) an SOC-adapted DDPG algorithm with a gradient expression . A numerical example shows the proposed method stabilizes a nonlinear partially observed system and avoids the divergence observed with certainty-equivalence LQG, while achieving acceptable performance with reasonable computation, highlighting its potential for safer, data-efficient control in real-world RL tasks.

Abstract

In this paper we propose a framework towards achieving two intertwined objectives: (i) equipping reinforcement learning with active exploration and deliberate information gathering, such that it regulates state and parameter uncertainties resulting from modeling mismatches and noisy sensory; and (ii) overcoming the computational intractability of stochastic optimal control. We approach both objectives by using reinforcement learning to compute the stochastic optimal control law. On one hand, we avoid the curse of dimensionality prohibiting the direct solution of the stochastic dynamic programming equation. On the other hand, the resulting stochastic optimal control reinforcement learning agent admits caution and probing, that is, optimal online exploration and exploitation. Unlike fixed exploration and exploitation balance, caution and probing are employed automatically by the controller in real-time, even after the learning process is terminated. We conclude the paper with a numerical simulation, illustrating how a Linear Quadratic Regulator with the certainty equivalence assumption may lead to poor performance and filter divergence, while our proposed approach is stabilizing, of an acceptable performance, and computationally convenient.
Paper Structure (12 sections, 4 theorems, 23 equations, 3 figures, 1 algorithm)

This paper contains 12 sections, 4 theorems, 23 equations, 3 figures, 1 algorithm.

Key Result

Lemma 1

(Smoothing theorem resnick2019probability): Let $(\Omega, \mathcal{A}, \mathbf{P})$ be a probability space, and $\chi:\Omega \to \mathbb{R}$ a measurable and $L^1$ function, i.e., $\mathbb{E\,}_{\mathbf{P}} | \chi | < \infty$, where $\mathbb{E\,}_{\mathbf{P}}$ is the expectation corresponding to $\m

Figures (3)

  • Figure 1: A graphical overview of stochastic optimal control: the controller is not only concerned with regulating the state estimate (mean, mode, ... etc), but also regulating the state uncertainty (or the state estimate quality) via driving the system through high observable regions, for instance, regions of better signal-to-noise ratio (SNR) and/or of more/better sensors. The green and pink regions correspond to state uncertainty propagation along two different trajectories. The trajectory in pink, resembling a trajectory under stochastic optimal control, takes into consideration regulating uncertainty, and hence, in its path to the origin, it chooses the path of higher observability.
  • Figure 2: The average reward of a $50$ different runs of Algorithm \ref{['algorithm:ALRL']} is shown in dark blue, while the shaded area is the corresponding two standard deviations about the average. In orange is the run with the highest terminal accumulative reward, which its corresponding controller is used to generate the closed-loop results below.
  • Figure 3: LQG (upper) vs RL dual control (lower): For each figure, the vertical axis is the magnitude of: the mean $\hat{x}_{k \mid k}(1)$ and $\textit{tr}(\Sigma_{k \mid k})$ which are shown in dark blue and orange, respectively, and the true state $x_k(1)$ shown in green.

Theorems & Definitions (8)

  • Remark 1
  • Lemma 1
  • Corollary
  • Corollary
  • proof
  • Theorem
  • proof
  • Remark 2