Actively Learning Reinforcement Learning: A Stochastic Optimal Control Approach
Mohammad S. Ramadan, Mahmoud A. Hayajnh, Michael T. Tolley, Kyriakos G. Vamvoudakis
TL;DR
This work addresses RL's vulnerability to model mismatch and safety concerns by embedding active information gathering into a stochastic optimal control (SOC) framework. It replaces the intractable, full information-state DP with an EKF-based wide-sense approximation and optimizes a policy via Deterministic Policy Gradient (DDPG) on the reduced state, yielding an agent that autonomously balances caution (safety under uncertainty) and probing (information gathering). Key contributions include (i) a SOC-informed RL formulation with information-state tracking and a SOC-aware objective, (ii) an EKF-based tractable approximation of the information state, and (iii) an SOC-adapted DDPG algorithm with a gradient expression $\nabla_\theta J_\theta = \mathbb{E}[ \nabla_u \mathcal{Q}(\hat{\pi},u) \nabla_\theta \mu_\theta(\hat{\pi}) ]$. A numerical example shows the proposed method stabilizes a nonlinear partially observed system and avoids the divergence observed with certainty-equivalence LQG, while achieving acceptable performance with reasonable computation, highlighting its potential for safer, data-efficient control in real-world RL tasks.
Abstract
In this paper we propose a framework towards achieving two intertwined objectives: (i) equipping reinforcement learning with active exploration and deliberate information gathering, such that it regulates state and parameter uncertainties resulting from modeling mismatches and noisy sensory; and (ii) overcoming the computational intractability of stochastic optimal control. We approach both objectives by using reinforcement learning to compute the stochastic optimal control law. On one hand, we avoid the curse of dimensionality prohibiting the direct solution of the stochastic dynamic programming equation. On the other hand, the resulting stochastic optimal control reinforcement learning agent admits caution and probing, that is, optimal online exploration and exploitation. Unlike fixed exploration and exploitation balance, caution and probing are employed automatically by the controller in real-time, even after the learning process is terminated. We conclude the paper with a numerical simulation, illustrating how a Linear Quadratic Regulator with the certainty equivalence assumption may lead to poor performance and filter divergence, while our proposed approach is stabilizing, of an acceptable performance, and computationally convenient.
