Table of Contents
Fetching ...

Understanding Inverse Reinforcement Learning under Overparameterization: Non-Asymptotic Analysis and Global Optimality

Ruijia Zhang, Siliang Zeng, Chenliang Li, Alfredo Garcia, Mingyi Hong

TL;DR

This work addresses inverse reinforcement learning with neural network–parameterized rewards and proposes a two-timescale, single-loop ML-IRL framework that uses a dynamically truncated neural soft Q-learning inner loop. It provides non-asymptotic convergence guarantees and, under overparameterization, conditions for global optimality of the learned reward and the corresponding policy, with finite-time complexity characterized by $\mathcal{O}(\epsilon^{-8})$ steps to achieve $\epsilon$-optimality. The analysis relies on local linearization of the neural networks for both the Q-function and the reward, along with ergodicity and regularity assumptions, yielding concrete rates for policy and gradient convergence. Empirically, the method demonstrates superior imitation performance on Mujoco tasks with a single expert trajectory, validating the practical impact of scalable neural-reward IRL with provable guarantees. Overall, the paper advances theoretical understanding and practical application of neural-network–parameterized IRL using a provably efficient, single-loop optimization framework.

Abstract

The goal of the Inverse reinforcement learning (IRL) task is to identify the underlying reward function and the corresponding optimal policy from a set of expert demonstrations. While most IRL algorithms' theoretical guarantees rely on a linear reward structure, we aim to extend the theoretical understanding of IRL to scenarios where the reward function is parameterized by neural networks. Meanwhile, conventional IRL algorithms usually adopt a nested structure, leading to computational inefficiency, especially in high-dimensional settings. To address this problem, we propose the first two-timescale single-loop IRL algorithm under neural network parameterized reward and provide a non-asymptotic convergence analysis under overparameterization. Although prior optimality results for linear rewards do not apply, we show that our algorithm can identify the globally optimal reward and policy under certain neural network structures. This is the first IRL algorithm with a non-asymptotic convergence guarantee that provably achieves global optimality in neural network settings.

Understanding Inverse Reinforcement Learning under Overparameterization: Non-Asymptotic Analysis and Global Optimality

TL;DR

This work addresses inverse reinforcement learning with neural network–parameterized rewards and proposes a two-timescale, single-loop ML-IRL framework that uses a dynamically truncated neural soft Q-learning inner loop. It provides non-asymptotic convergence guarantees and, under overparameterization, conditions for global optimality of the learned reward and the corresponding policy, with finite-time complexity characterized by steps to achieve -optimality. The analysis relies on local linearization of the neural networks for both the Q-function and the reward, along with ergodicity and regularity assumptions, yielding concrete rates for policy and gradient convergence. Empirically, the method demonstrates superior imitation performance on Mujoco tasks with a single expert trajectory, validating the practical impact of scalable neural-reward IRL with provable guarantees. Overall, the paper advances theoretical understanding and practical application of neural-network–parameterized IRL using a provably efficient, single-loop optimization framework.

Abstract

The goal of the Inverse reinforcement learning (IRL) task is to identify the underlying reward function and the corresponding optimal policy from a set of expert demonstrations. While most IRL algorithms' theoretical guarantees rely on a linear reward structure, we aim to extend the theoretical understanding of IRL to scenarios where the reward function is parameterized by neural networks. Meanwhile, conventional IRL algorithms usually adopt a nested structure, leading to computational inefficiency, especially in high-dimensional settings. To address this problem, we propose the first two-timescale single-loop IRL algorithm under neural network parameterized reward and provide a non-asymptotic convergence analysis under overparameterization. Although prior optimality results for linear rewards do not apply, we show that our algorithm can identify the globally optimal reward and policy under certain neural network structures. This is the first IRL algorithm with a non-asymptotic convergence guarantee that provably achieves global optimality in neural network settings.

Paper Structure

This paper contains 36 sections, 19 theorems, 101 equations, 1 table, 3 algorithms.

Key Result

Theorem 1

Suppose Assumptions Ergodicity, regularity of policy and Regularity of Stationary Distribution hold. Selecting stepsize $\alpha:=\frac{\alpha_0}{K^\sigma}$ for the reward update step and $\eta=\min \{K^{-\frac{1}{2}},(1-\gamma) / 8\}$ for the TD update step in Algorithm alg:1 where $\alpha_0>0$ and where the expectation is over all randomness and $\left\|\log \pi_{k+1}-\log \pi_{\theta_k}\right\|

Theorems & Definitions (25)

  • Definition 1
  • Definition 2
  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Corollary 1
  • Definition A
  • Definition B
  • Lemma 1
  • Lemma 2
  • ...and 15 more