Table of Contents
Fetching ...

Consistent inverse optimal control for discrete-time nonlinear stochastic systems

Ziliang Wang, Han Zhang, Axel Ringh

TL;DR

The paper tackles inverse optimal control for discrete-time nonlinear stochastic systems by reformulating the forward problem with occupancy measures into an infinite-dimensional linear program. It then derives a finite-dimensional, convex sum-of-squares estimator via polynomial approximation, proving asymptotic and statistical consistency as data and polynomial order grow. Numerical experiments on linear, nonlinear, and chaotic-like systems validate accuracy, robustness, and generalization, and show policy reconstruction benefits beyond behaviour cloning. The approach offers a scalable, theoretically grounded IOC framework capable of handling noise, nonlinearity, and long-horizon discounting in practice.

Abstract

Inverse Optimal Control (IOC) seeks to recover an unknown cost from expert demonstrations, and it provides a systematic way of modeling experts' decision mechanisms while considering the prior information of the cost functions. Nevertheless, existing IOC methods have consistency issue with the estimator under noisy and nonlinear settings. In this paper, we consider a discrete-time nonlinear system with process noise, and it is controlled by an optimal policy that minimizes the expectation of a discounted cumulative cost function across an infinite time-horizon. In particular, the cost function takes the form of a linear combination of a priori known feature functions. In this setting, we first adopt Lasserre's reformulation of the forward problem with occupancy measure. Next, we propose the infinite dimensional IOC algorithm and further approximate it with Lagrange interpolating polynomials, which results in a convex, finite-dimensional sum-of-squares optimization. Moreover, the estimator is shown to be asymptotically and statistically consistent. Finally, we validate the theoretical results and illustrate the performance of our method with numerical experiments. In addition, the robustness and generalizability performance of the proposed IOC algorithm are also illustrated.

Consistent inverse optimal control for discrete-time nonlinear stochastic systems

TL;DR

The paper tackles inverse optimal control for discrete-time nonlinear stochastic systems by reformulating the forward problem with occupancy measures into an infinite-dimensional linear program. It then derives a finite-dimensional, convex sum-of-squares estimator via polynomial approximation, proving asymptotic and statistical consistency as data and polynomial order grow. Numerical experiments on linear, nonlinear, and chaotic-like systems validate accuracy, robustness, and generalization, and show policy reconstruction benefits beyond behaviour cloning. The approach offers a scalable, theoretically grounded IOC framework capable of handling noise, nonlinearity, and long-horizon discounting in practice.

Abstract

Inverse Optimal Control (IOC) seeks to recover an unknown cost from expert demonstrations, and it provides a systematic way of modeling experts' decision mechanisms while considering the prior information of the cost functions. Nevertheless, existing IOC methods have consistency issue with the estimator under noisy and nonlinear settings. In this paper, we consider a discrete-time nonlinear system with process noise, and it is controlled by an optimal policy that minimizes the expectation of a discounted cumulative cost function across an infinite time-horizon. In particular, the cost function takes the form of a linear combination of a priori known feature functions. In this setting, we first adopt Lasserre's reformulation of the forward problem with occupancy measure. Next, we propose the infinite dimensional IOC algorithm and further approximate it with Lagrange interpolating polynomials, which results in a convex, finite-dimensional sum-of-squares optimization. Moreover, the estimator is shown to be asymptotically and statistically consistent. Finally, we validate the theoretical results and illustrate the performance of our method with numerical experiments. In addition, the robustness and generalizability performance of the proposed IOC algorithm are also illustrated.

Paper Structure

This paper contains 15 sections, 7 theorems, 81 equations, 5 figures, 1 algorithm.

Key Result

Proposition 3.2

The problem eq:IOC_origin attains at least one optimal solution, and the optimal value is $0$. Moreover, for any optimal solution $(\theta_\ell^\star, V^\star, \psi^\star)$, $\bar{\pi}$ is an optimal control policy to the optimal control problem eq:forward_problem_obj_fun with the running cost $\ell

Figures (5)

  • Figure 1: Histogram of Estimation Errors for the LQR Case. Only the central 99.5% of the samples are displayed. The experiment uses redundant polynomial degrees ($d_\psi=3, d_V=2$) to access the robustness of the proposed method under an overcomplete basis representation.
  • Figure 2: Estimate error histogram for the Temperature control system under different approximation degrees. Only the central 99.5% of data points are plotted. Estimation accuracy imporves steadily as the approximation degree increases from 2 to 4, with narrower error distributions for higher-order approximations.
  • Figure 3: Estimated accuracy versus data volume $M$. The full dataset contains 16384 one-step trajectories. For each value of $M$, the training set is obtained by randomly sampling $M$ trajectories from the full dataset. With approximately $100$ data samples, the estimated errors reach around $5\%$ for $d_\psi=3, d_\psi=2$ and around $2\%$ for $d_\psi=5, d_\psi=4$, after that they level off.
  • Figure 4: Estimated errors of the cost function coefficient vectors and SMAPE of the recovered control policy across varying approximation degrees. All experiments utilize around $131$k data samples. For each $d_\psi$, best result across different $d_V$ values is reported.
  • Figure 5: The angle and control trajectories of the test inverted pendulum system. Trajectories is generated by three different controller: (1) MPC controller using the ground truth coefficients (labeled by "true"); (2) MPC controller using the estimated coefficients (labeled by "est"); (3) Behavior cloning controller (three-layes neural network with 128 hidden units) trained on the same dataset (labeled by "bc").

Theorems & Definitions (11)

  • Proposition 3.2
  • Proposition 3.3: Finite-time IOC algorithm
  • Remark 3.4
  • Remark 4.1
  • Theorem 4.2
  • Remark 4.3
  • Remark 4.4
  • Lemma 4.5
  • Lemma 4.6
  • Lemma 4.7
  • ...and 1 more