Adaptive Inverse Reinforcement Learning with Online Off-Policy Data Collection
Yibei Li, Yuexin Cao, Zhixin Liu, Lihua Xie
TL;DR
The paper tackles reconstructing unknown cost functions for control from demonstrations in a model-free setting using online off-policy data under a persistently exciting condition. It introduces a direct, adaptive inverse IRL approach based on full NT-step primal-dual interior-point methods to solve a time-varying SDP for the linear-quadratic case and extends to nonlinear problems via differential dynamic programming. Key contributions include convergence guarantees for the online algorithm under finite SNR, a gradient-based data-driven ILQR formulation, and a nonlinear IRL framework with end-to-end gradient computation for theta via backward recursions. The approach improves data efficiency by using off-policy data rather than requiring on-policy excitation and demonstrates competitive performance against model-based benchmarks in numerical simulations, with a clear path to real-world nonlinear IRL via DDP. Overall, the work advances direct, adaptive IRL with provable online convergence and broad applicability to both linear and nonlinear optimal control problems.
Abstract
In this paper, the inverse reinforcement learning (IRL) problem is addressed to reconstruct the unknown cost function underlying an observed optimal policy in a model-free manner, whose online adaptation with completely off-policy system data still remains unclear in the literature. Without prior knowledge of the system model parameters, an adaptive and direct learning rule for the cost parameter is proposed using online off-policy system data, which only needs to satisfy the mild persistently exciting condition in the general data-driven paradigm. The adaptive and online IRL algorithm is achieved by designing full Nesterov-Todd (NT)-step primal-dual interior-point iterations. Despite solving a nonlinear and time-varying semi-definite program (SDP), the influence of system noise is rigorously analyzed, and the proposed online algorithm is shown to achieve a sublinear convergence. The proposed method is further generalized to nonlinear IRL based on differential dynamic programming. The gradient of the loss function is directly obtained via a backward pass, which eliminates the need to repeatedly solve forward RL problems as in conventional bi-level IRL frameworks. Finally, the efficiency and effectiveness of the proposed algorithms are demonstrated by numerical examples.
