Randomized algorithms and PAC bounds for inverse reinforcement learning in continuous spaces

Angeliki Kamoutsi; Peter Schmitt-Förster; Tobias Sutter; Volkan Cevher; John Lygeros

Randomized algorithms and PAC bounds for inverse reinforcement learning in continuous spaces

Angeliki Kamoutsi, Peter Schmitt-Förster, Tobias Sutter, Volkan Cevher, John Lygeros

TL;DR

The paper tackles inverse reinforcement learning in continuous-state and continuous-action MDPs by formulating IRL as an infinite-dimensional linear program over occupancy measures, linking costs to discounted value via $V_c^{\pi}(\nu_0)=\langle \mu_{\nu_0}^{\pi},c\rangle$. It proves that the inverse feasibility set $\mathcal{C}(\pi_E)$ comprises costs for which the observed policy is optimal, characterized through duality with the operator $T_\gamma^*$ and a normalization constraint to exclude trivial rewards. The authors then develop a scalable, finite-dimensional approximation using linear function bases and apply the scenario approach to obtain $\varepsilon$-optimal solutions with probabilistic guarantees, plus sample-based bounds when only finite demonstrations and a generative model are available. These results provide formal guarantees for recovering cost functions in continuous IRL and have practical impact for robotics and safety-critical decision-making, especially where forward solves are expensive or intractable.

Abstract

This work studies discrete-time discounted Markov decision processes with continuous state and action spaces and addresses the inverse problem of inferring a cost function from observed optimal behavior. We first consider the case in which we have access to the entire expert policy and characterize the set of solutions to the inverse problem by using occupation measures, linear duality, and complementary slackness conditions. To avoid trivial solutions and ill-posedness, we introduce a natural linear normalization constraint. This results in an infinite-dimensional linear feasibility problem, prompting a thorough analysis of its properties. Next, we use linear function approximators and adopt a randomized approach, namely the scenario approach and related probabilistic feasibility guarantees, to derive epsilon-optimal solutions for the inverse problem. We further discuss the sample complexity for a desired approximation accuracy. Finally, we deal with the more realistic case where we only have access to a finite set of expert demonstrations and a generative model and provide bounds on the error made when working with samples.

Randomized algorithms and PAC bounds for inverse reinforcement learning in continuous spaces

TL;DR

. It proves that the inverse feasibility set

comprises costs for which the observed policy is optimal, characterized through duality with the operator

and a normalization constraint to exclude trivial rewards. The authors then develop a scalable, finite-dimensional approximation using linear function bases and apply the scenario approach to obtain

-optimal solutions with probabilistic guarantees, plus sample-based bounds when only finite demonstrations and a generative model are available. These results provide formal guarantees for recovering cost functions in continuous IRL and have practical impact for robotics and safety-critical decision-making, especially where forward solves are expensive or intractable.

Abstract

Paper Structure (20 sections, 12 theorems, 96 equations, 4 figures)

This paper contains 20 sections, 12 theorems, 96 equations, 4 figures.

Introduction
Contributions.
Related literature.
Basic definitions and notations.
Markov decision processes and linear programming formulation
Continuous Markov decision process.
Occupancy measures.
The linear programming approach.
Inverse reinforcement learning and characterization of solutions
Towards recovering a nearly optimal cost function
Normalization constraint
The case of known dynamics and expert policy
Sample-based inverse reinforcement learning
Sampling process.
Numerical Results
...and 5 more sections

Key Result

Theorem 1

Let $\pi_{\textup{E}}\in\Pi$. Under Assumption assumption1 on the Markov decision model $\mathcal{M}_c$, the following assertions are equivalent As a consequence, $\mathcal{C}(\pi_{\textup{E}})=\mathcal{C}^{0}(\pi_{\textup{E}})=\bigcap_{\varepsilon>0}\mathcal{C}^{\varepsilon}(\pi_{\textup{E}})$. Moreover, $\mathcal{C}(\pi_{\textup{E}})$ is a convex cone and $\left\lVert \cdot \right\rVert_{\textu

Figures (4)

Figure 1: Illustration of Theorem \ref{['IP']} for $\varepsilon_1>\varepsilon_2$.
Figure 2: Main building blocks of our methodology
Figure 3: Solutions of the Sampled Inverse Program \ref{['SIP']}. The variable $N$ is the number of i.i.d. samples $(x, a)$ drawn uniformly from $\mathcal{X}\times\mathcal{A}$. We run $1000$ independent experiments. Plot (a) shows the empirical probability of the estimated cost function $\tilde{c}_N$ being an element of the feasibility set, as described in Theorem \ref{['scenario']} for given values of $N$ and $\epsilon$. Plot (b) shows the objective value of the random program \ref{['SIP']}, i.e., $\tilde{\varepsilon}_{N}$ on average over the $1000$ experiments, where the shaded area shows the standard deviations. Plot (c) is a visualization of the theoretical sample complexity as given by Theorem \ref{['scenario']}. For various values of $\delta$ and $\epsilon$, we plot the sample size $N=\textup{N}(n_{c} + n_{u} + 1, g(\frac{\epsilon}{L_{\Lambda}}), \delta)$. The variation parameter is set to $\Delta=1\cdot 10^{-7}$. Plot (d) compares the discounted long-run costs $V_{\bar{\tilde{c}}_{N}}^\pi(\nu_0)$ for the average $\bar{\tilde{c}}_{N}$ of the learnt costs $\tilde{c}_N$ under the expert policy $\pi_{\textup{E}}$ (red) and the optimal policy (blue). The solid line plots average over $1000$ independent experiments, where the shaded area shows the standard deviations.
Figure 4: Solutions of the Sampled Inverse Program \ref{['SIP2']}. The variable $N$ is the number of i.i.d. samples $(x, a)$ drawn uniformly from $\mathcal{X}\times\mathcal{A}$. We run $1000$ independent experiments. Plot (a) shows the empirical probability of the estimated cost function $\tilde{c}_{N,m,n,k}$ being an element of the feasibility set, as described in Theorem \ref{['lastone']} for different $N,k$ pairs given a chosen accuracy parameter $\epsilon$. Plot (b) shows the theoretical lower bound on $k$ depending on $N$, for a set $\epsilon$, as described by Theorem \ref{['lastone']}.

Theorems & Definitions (26)

Definition 1: IRL Ng:2000Metelli:2021
Definition 2
Theorem 1: Inverse feasibility set characterization
proof
Proposition 1: $\varepsilon$-inverse feasibility set characterization
proof
Proposition 2
proof : Proof of Proposition \ref{['prop:e0']}
Proposition 3
proof
...and 16 more

Randomized algorithms and PAC bounds for inverse reinforcement learning in continuous spaces

TL;DR

Abstract

Randomized algorithms and PAC bounds for inverse reinforcement learning in continuous spaces

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (26)