Table of Contents
Fetching ...

Continuous-time Risk-sensitive Reinforcement Learning via Quadratic Variation Penalty

Yanwei Jia

TL;DR

This work develops a continuous-time risk-sensitive reinforcement learning framework based on an exponential utility and entropy regularization, reframing risk sensitivity through a quadratic variation penalty that appears in a martingale characterization. A risk-sensitive q-function is defined and learned via martingale conditions, enabling q-learning to handle both finite-horizon and ergodic settings without resorting to nonlinear exponential Bellman equations. The theory is instantiated in Merton's investment problem with power utility, where convergence results and the role of the temperature parameter are analyzed, and in off-policy linear-quadratic control, where empirical evidence shows improved finite-sample performance for appropriate risk-sensitivity levels. Overall, the approach provides a tractable, principled pathway to incorporate risk sensitivity and robustness into continuous-time RL, with clear guidance on algorithmic design and parameter schedules.

Abstract

This paper studies continuous-time risk-sensitive reinforcement learning (RL) under the entropy-regularized, exploratory diffusion process formulation with the exponential-form objective. The risk-sensitive objective arises either as the agent's risk attitude or as a distributionally robust approach against the model uncertainty. Owing to the martingale perspective in Jia and Zhou (2023) the risk-sensitive RL problem is shown to be equivalent to ensuring the martingale property of a process involving both the value function and the q-function, augmented by an additional penalty term: the quadratic variation of the value process, capturing the variability of the value-to-go along the trajectory. This characterization allows for the straightforward adaptation of existing RL algorithms developed for non-risk-sensitive scenarios to incorporate risk sensitivity by adding the realized variance of the value process. Additionally, I highlight that the conventional policy gradient representation is inadequate for risk-sensitive problems due to the nonlinear nature of quadratic variation; however, q-learning offers a solution and extends to infinite horizon settings. Finally, I prove the convergence of the proposed algorithm for Merton's investment problem and quantify the impact of temperature parameter on the behavior of the learning procedure. I also conduct simulation experiments to demonstrate how risk-sensitive RL improves the finite-sample performance in the linear-quadratic control problem.

Continuous-time Risk-sensitive Reinforcement Learning via Quadratic Variation Penalty

TL;DR

This work develops a continuous-time risk-sensitive reinforcement learning framework based on an exponential utility and entropy regularization, reframing risk sensitivity through a quadratic variation penalty that appears in a martingale characterization. A risk-sensitive q-function is defined and learned via martingale conditions, enabling q-learning to handle both finite-horizon and ergodic settings without resorting to nonlinear exponential Bellman equations. The theory is instantiated in Merton's investment problem with power utility, where convergence results and the role of the temperature parameter are analyzed, and in off-policy linear-quadratic control, where empirical evidence shows improved finite-sample performance for appropriate risk-sensitivity levels. Overall, the approach provides a tractable, principled pathway to incorporate risk sensitivity and robustness into continuous-time RL, with clear guidance on algorithmic design and parameter schedules.

Abstract

This paper studies continuous-time risk-sensitive reinforcement learning (RL) under the entropy-regularized, exploratory diffusion process formulation with the exponential-form objective. The risk-sensitive objective arises either as the agent's risk attitude or as a distributionally robust approach against the model uncertainty. Owing to the martingale perspective in Jia and Zhou (2023) the risk-sensitive RL problem is shown to be equivalent to ensuring the martingale property of a process involving both the value function and the q-function, augmented by an additional penalty term: the quadratic variation of the value process, capturing the variability of the value-to-go along the trajectory. This characterization allows for the straightforward adaptation of existing RL algorithms developed for non-risk-sensitive scenarios to incorporate risk sensitivity by adding the realized variance of the value process. Additionally, I highlight that the conventional policy gradient representation is inadequate for risk-sensitive problems due to the nonlinear nature of quadratic variation; however, q-learning offers a solution and extends to infinite horizon settings. Finally, I prove the convergence of the proposed algorithm for Merton's investment problem and quantify the impact of temperature parameter on the behavior of the learning procedure. I also conduct simulation experiments to demonstrate how risk-sensitive RL improves the finite-sample performance in the linear-quadratic control problem.
Paper Structure (35 sections, 10 theorems, 94 equations, 2 figures, 1 table, 3 algorithms)

This paper contains 35 sections, 10 theorems, 94 equations, 2 figures, 1 table, 3 algorithms.

Key Result

Lemma 1

Suppose that there is a continuous function $V^*(t,x;\epsilon)$, satisfying $V^*(T,x;\epsilon) = h(x)$ and $\mathbb{E}\left[ e^{\frac{\epsilon^2}{2} \langle {V^*}^{\bm a}\rangle(T) } \right] < \infty$ for any admissible $\bm a$. If there exists a control $\bm a^*$, such that for any initial conditi where $\langle V^{*^{\bm a}} \rangle$ is the QV of the process $V^{*}(s, X_s^{\bm a};\epsilon)$. Th

Figures (2)

  • Figure 1: The illustration of the learned policy. The left panel shows the convergence of the mean squared error of the learned parameters in the policy, and the right panel shows the mean equivalent relative wealth loss of the learned deterministic policy. Both panels are in the log-scales. The results are based on simulated data with 1000 runs. The shaded area indicates twice the standard deviation of the estimated expectation. The temperature parameter is taken as $\lambda=3$, and the learning rate $a_{\psi, n} = (1+n)^{-1}$. The number of episode within each simulation run is $10^5$.
  • Figure 2: The illustration of the learned policy. The left panel shows the convergence of the mean squared error of the learned parameters in the policy, and the right panel shows the mean equivalent relative wealth loss of the learned deterministic policy. Both panels are in the log-scales. The results are based on simulated data with 1000 runs. The shaded area indicates twice the standard deviation of the estimated expectation. The temperature parameter is taken as $\lambda_n=3 (n+1)^{-1/2}$, and the learning rate $a_{\psi, n} = (n+1)^{-1/2}$. The number of episode within each simulation run is $10^5$.

Theorems & Definitions (21)

  • Lemma 1
  • Definition 1
  • Theorem 1
  • Definition 2
  • Proposition 1
  • Theorem 2
  • Example 1
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • ...and 11 more