Research on Optimal Control Problem Based on Reinforcement Learning under Knightian Uncertainty
Ziyu Li, Chen Fei, Weiyin Fei
TL;DR
This work develops a unified framework for reinforcement learning in continuous time under Knightian uncertainty by integrating sublinear (nonlinear) expectation theory with entropy-regularized relaxed stochastic control. It derives a G-HJB equation and characterizes the optimal randomized control, proving that in the linear-quadratic setting the optimal policy is Gaussian with a variance that depends on Knightian uncertainty bounds, and establishes a solvability equivalence between exploratory and non-exploratory problems alongside an explicit exploration cost of $\mathcal{C}^{u^*,\theta^*}(x) = \frac{\lambda}{2\rho}$. The paper also proves a vanishing-exploration property: as $\lambda \to 0$, the Gaussian policy converges to the deterministic optimal control and the exploratory value function converges to its non-exploratory counterpart. A numerical LQ example with an indoor-temperature-control scenario validates the theoretical predictions, showing how the discount rate $\rho$ and uncertainty bounds shape the optimal policy and its convergence behavior, with practical implications for designing robust RL algorithms under model uncertainty.
Abstract
Considering that the decision-making environment faced by reinforcement learning (RL) agents is full of Knightian uncertainty, this paper describes the exploratory state dynamics equation in Knightian uncertainty to study the entropy-regularized relaxed stochastic control problem in a Knightian uncertainty environment. By employing stochastic analysis theory and the dynamic programming principle under nonlinear expectation, we derive the Hamilton-Jacobi-Bellman (HJB) equation and solve for the optimal policy that achieves a trade-off between exploration and exploitation. Subsequently, for the linear-quadratic (LQ) case, we examine the agent's optimal randomized feedback control under both state-dependent and state-independent reward scenarios, proving that the optimal randomized feedback control follows a Gaussian distribution in the LQ framework. Furthermore, we investigate how the degree of Knightian uncertainty affects the variance of the optimal feedback policy. Additionally, we establish the solvability equivalence between non-exploratory and exploratory LQ problems under Knightian uncertainty and analyze the associated exploration cost. Finally, we provide an LQ example and validate the theoretical findings through numerical simulations.
