Table of Contents
Fetching ...

A Robust Model-Based Approach for Continuous-Time Policy Evaluation with Unknown Lévy Process Dynamics

Qihao Ye, Xiaochuan Tian, Yuhua Zhu

Abstract

This paper develops a model-based framework for continuous-time policy evaluation (CTPE) in reinforcement learning, incorporating both Brownian and Lévy noise to model stochastic dynamics influenced by rare and extreme events. Our approach formulates the policy evaluation problem as solving a partial integro-differential equation (PIDE) for the value function with unknown coefficients. A key challenge in this setting is accurately recovering the unknown coefficients in the stochastic dynamics, particularly when driven by Lévy processes with heavy tail effects. To address this, we propose a robust numerical approach that effectively handles both unbiased and censored trajectory datasets. This method combines maximum likelihood estimation with an iterative tail correction mechanism, improving the stability and accuracy of coefficient recovery. Additionally, we establish a theoretical bound for the policy evaluation error based on coefficient recovery error. Through numerical experiments, we demonstrate the effectiveness and robustness of our method in recovering heavy-tailed Lévy dynamics and verify the theoretical error analysis in policy evaluation.

A Robust Model-Based Approach for Continuous-Time Policy Evaluation with Unknown Lévy Process Dynamics

Abstract

This paper develops a model-based framework for continuous-time policy evaluation (CTPE) in reinforcement learning, incorporating both Brownian and Lévy noise to model stochastic dynamics influenced by rare and extreme events. Our approach formulates the policy evaluation problem as solving a partial integro-differential equation (PIDE) for the value function with unknown coefficients. A key challenge in this setting is accurately recovering the unknown coefficients in the stochastic dynamics, particularly when driven by Lévy processes with heavy tail effects. To address this, we propose a robust numerical approach that effectively handles both unbiased and censored trajectory datasets. This method combines maximum likelihood estimation with an iterative tail correction mechanism, improving the stability and accuracy of coefficient recovery. Additionally, we establish a theoretical bound for the policy evaluation error based on coefficient recovery error. Through numerical experiments, we demonstrate the effectiveness and robustness of our method in recovering heavy-tailed Lévy dynamics and verify the theoretical error analysis in policy evaluation.

Paper Structure

This paper contains 24 sections, 6 theorems, 49 equations, 9 figures, 8 algorithms.

Key Result

Lemma 2.1

Given the probability density function of the stochastic process $X_t$ governed by the fractional Fokker-Planck equation eq:FFPE_condition_on_x_0, the corresponding value function defined in eq:definition_of_the_value_function satisfies

Figures (9)

  • Figure 1: Comparison of different approaches in estimating $D_{\textnormal{f}}(x)$ across 12 independent tests for each case. Left panel: results using unbiased trajectory data, showing instability in recovery with a large number of outliers. Right panel: results using censored trajectory data with filtered tails, leading to more stable recovery but introducing significant bias. Middle panel: results incorporating the tail correction technique, which improves the accuracy and robustness of coefficient recovery. Further details on the numerical tests can be found in \ref{['sec:numerical_experiments']}.
  • Figure 2: Contour plots showing the objective landscape for varying TCF values, with $D_{\textnormal{o}}$ on the $x$-axis and $D_{\textnormal{f}}$ on the $y$-axis. The white "$\times$" denotes the ground truth $(4, 3)$, while the red dot represents the minimizer of the objective for each respective TCF value. The observed shifts in the minimizer across different TCF values highlight the critical role of TCF selection in estimation accuracy. This corresponds to a censored trajectory dataset obtained by an MCMC sampler.
  • Figure 3: The left and right graphs illustrate how adaptively adjusting the TCF improves the accuracy of recovering the coefficients $D_{\textnormal{f}}$ and $D_{\textnormal{o}}$. The central diagram illustrates the process of adaptively applying the tail correction technique. This corresponds to a censored trajectory dataset obtained by an MCMC sampler. In practice, we update the TCF in each step to enhance the overall efficiency of the process, as outlined in \ref{['alg:Adam_with_tail_correction_concise']}.
  • Figure 4: Relative errors of $b$ (left panel), $D_{\textnormal{o}}$ (middle panel), $D_{\textnormal{f}}$ (right panel) versus the number of trajectories for $\alpha = 0.3$ (deep color) and $\alpha = 0.6$ (light color). Related to \ref{['eg:unbiased_constant']}
  • Figure 5: Left panel: Relative error (after removing the outliers) of $b, D_{\textnormal{o}}, D_{\textnormal{f}}$ versus the number of trajectories for $\alpha = 0.3$. Right panel: Recovery results for 100,000 trajectories for $\alpha = 0.3$, comparing predicted variables with ground truth. Related to \ref{['eg:unbiased_variable']}.
  • ...and 4 more figures

Theorems & Definitions (16)

  • Lemma 2.1
  • Theorem 3.1: Policy Evaluation Error
  • Example 4.1
  • Example 4.2
  • Example 4.3
  • Example 4.4
  • Example 4.5
  • Definition B.1
  • Lemma B.2
  • Lemma B.3
  • ...and 6 more