Impact of Computation in Integral Reinforcement Learning for Continuous-Time Control

Wenhan Cao; Wei Pan

Impact of Computation in Integral Reinforcement Learning for Continuous-Time Control

Wenhan Cao, Wei Pan

TL;DR

The paper investigates how the numerical computation of the PEV integral in Integral Reinforcement Learning for continuous-time control affects policy iteration. By linking PI to Newton's method on the HJB equation, it shows that quadrature-induced errors propagate as a bounded extra term, influencing convergence. It quantifies the quadrature error within an RKHS framework, proving that Bayesian quadrature with a Matérn kernel achieves $O(N^{-b})$ convergence while the trapezoidal rule yields $O(N^{-2})$ under suitable smoothness, with Wiener kernel baselines discussed. Experimental results on linear and nonlinear control tasks validate the theoretical rates and demonstrate the practical impact of quadrature choice on learned controllers.

Abstract

Integral reinforcement learning (IntRL) demands the precise computation of the utility function's integral at its policy evaluation (PEV) stage. This is achieved through quadrature rules, which are weighted sums of utility functions evaluated from state samples obtained in discrete time. Our research reveals a critical yet underexplored phenomenon: the choice of the computational method -- in this case, the quadrature rule -- can significantly impact control performance. This impact is traced back to the fact that computational errors introduced in the PEV stage can affect the policy iteration's convergence behavior, which in turn affects the learned controller. To elucidate how computation impacts control, we draw a parallel between IntRL's policy iteration and Newton's method applied to the Hamilton-Jacobi-Bellman equation. In this light, computational error in PEV manifests as an extra error term in each iteration of Newton's method, with its upper bound proportional to the computational error. Further, we demonstrate that when the utility function resides in a reproducing kernel Hilbert space (RKHS), the optimal quadrature is achievable by employing Bayesian quadrature with the RKHS-inducing kernel function. We prove that the local convergence rates for IntRL using the trapezoidal rule and Bayesian quadrature with a Matérn kernel to be $O(N^{-2})$ and $O(N^{-b})$, where $N$ is the number of evenly-spaced samples and $b$ is the Matérn kernel's smoothness parameter. These theoretical findings are finally validated by two canonical control tasks.

Impact of Computation in Integral Reinforcement Learning for Continuous-Time Control

TL;DR

convergence while the trapezoidal rule yields

under suitable smoothness, with Wiener kernel baselines discussed. Experimental results on linear and nonlinear control tasks validate the theoretical rates and demonstrate the practical impact of quadrature choice on learned controllers.

Abstract

and

, where

is the number of evenly-spaced samples and

is the Matérn kernel's smoothness parameter. These theoretical findings are finally validated by two canonical control tasks.

Paper Structure (20 sections, 10 theorems, 53 equations, 13 figures)

This paper contains 20 sections, 10 theorems, 53 equations, 13 figures.

Introduction
Problem Formulation
Theoretical Analysis
Convergence Analysis of PI
Computational Error Quantification
Convergence rate of IntRL For Different Quadrature Rules
Experimental Results
Conclusion and Discussion
Motivating Example for Known Internal Dynamics
Definition of Admissible Policies
CT Bellman Equation for Known Internal Dynamics
Proof of Lemma \ref{['lemma.Newton’s method']}
Proof of Theorem \ref{['theorem.Convergence of PI']}
Details for Matérn Kernel
Illustration of BQ
...and 5 more sections

Key Result

Theorem 1

A unique, positive definite, and continuous function $V^*$ serves as the solution to the HJB equation: In this case, $V^*$ is the optimal value function defined in eq.optimal value function. Consequently, the optimal control policy can be represented as $u^*(x) = -\frac{1}{2}R^{-1}g(x)^{\top}\nabla_x {V^*}$.

Figures (13)

Figure 1: Evaluation of accumulated cost $J$ for controllers, computed through the trapezoidal rule and BQ with a Matérn kernel, compared to the optimal controller cost $J^*$. This simulation is performed on different sample sizes $N$ and relies on evenly-spaced samples within a canonical control task in vrabie2009neural.
Figure 2: Relationship between PI and Newton's method. The standard PI can be regarded as performing Newton's method to solve the HJB equation, while PI incorporated by the computational error can be seen as Newton's method with bounded error.
Figure 3: Flowchart illustrating the quantification of computational error. The computational error is defined as the absolute difference between the true integral and its approximation derived from the quadrature rule. The computational error is bounded by the product of the integrand’s norm in the RKHS, and the worst-case error. When employed as the quadrature rule, BQ minimizes the worst-case error, which coincides precisely with BQ’s posterior covariance.
Figure 4: Simulations for Example 1. The convergence rates of the learned parameters $\hat{\omega}^{(\infty)}$ solved by the trapezoidal rule and BQ with Matérn Kernel ($b=4$) are shown to be $O(N^{-2})$ and $O(N^{-4})$.
Figure 5: Simulations for Example 2. The convergence rates of the learned parameters $\hat{\omega}^{(\infty)}$ solved by the trapezoidal rule and BQ with Matérn Kernel ($b=4$) are shown to be $O(N^{-2})$ and $O(N^{-4})$.
...and 8 more figures

Theorems & Definitions (16)

Theorem 1: HJB Equation Properties vrabie2009neural
Lemma 1: PI as Newton’s Method in Banach Space
Theorem 2: Convergence of Newton's Method with Bounded Error
Theorem 3: Convergence Rate of IntRL Concerning Computational Error
Corollary 1: Convergence Rate of IntRL for Trapezoidal Rule and BQ with Matern Kernel
Definition 1: Admissible Policies vrabie2009neural
proof
Lemma 2: Convergence of the Standard Newton's Method ostrowski2016solutionlancaster1966error
Proposition 1: Iteration Error for Newton's Method
proof
...and 6 more

Impact of Computation in Integral Reinforcement Learning for Continuous-Time Control

TL;DR

Abstract

Impact of Computation in Integral Reinforcement Learning for Continuous-Time Control

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (16)