Table of Contents
Fetching ...

Apprenticeship learning with prior beliefs using inverse optimization

Mauricio Junca, Esteban Leiva

TL;DR

The paper tackles the problem of learning cost functions and apprentice policies in MDPs when the expert is suboptimal and prior beliefs about the cost structure exist. It unifies IRL, IO, and AL by casting the apprenticeship-learning objective as a regularized convex-concave min-max problem solvable by stochastic mirror descent, and shows that AL with zero regularization emerges as a special case. A key contribution is incorporating a prior cost $\hat{\mathbf{c}}$ through a projection onto the inverse-feasible set and solving a regularized problem that remains convex, with theoretical convergence guarantees. The approach yields interpretable cost vectors and apprentice policies that balance fidelity to the prior with competitive performance relative to the expert, validated on gridworld experiments that reveal the importance of regularization. Overall, the work clarifies the IO-IRL-AL relationship, provides a practical SMD-based solver for regularized cost learning, and demonstrates improved learning of cost structures and apprentice behaviors when priors are available.

Abstract

The relationship between inverse reinforcement learning (IRL) and inverse optimization (IO) for Markov decision processes (MDPs) has been relatively underexplored in the literature, despite addressing the same problem. In this work, we revisit the relationship between the IO framework for MDPs, IRL, and apprenticeship learning (AL). We incorporate prior beliefs on the structure of the cost function into the IRL and AL problems, and demonstrate that the convex-analytic view of the AL formalism (Kamoutsi et al., 2021) emerges as a relaxation of our framework. Notably, the AL formalism is a special case in our framework when the regularization term is absent. Focusing on the suboptimal expert setting, we formulate the AL problem as a regularized min-max problem. The regularizer plays a key role in addressing the ill-posedness of IRL by guiding the search for plausible cost functions. To solve the resulting regularized-convex-concave-min-max problem, we use stochastic mirror descent (SMD) and establish convergence bounds for the proposed method. Numerical experiments highlight the critical role of regularization in learning cost vectors and apprentice policies.

Apprenticeship learning with prior beliefs using inverse optimization

TL;DR

The paper tackles the problem of learning cost functions and apprentice policies in MDPs when the expert is suboptimal and prior beliefs about the cost structure exist. It unifies IRL, IO, and AL by casting the apprenticeship-learning objective as a regularized convex-concave min-max problem solvable by stochastic mirror descent, and shows that AL with zero regularization emerges as a special case. A key contribution is incorporating a prior cost through a projection onto the inverse-feasible set and solving a regularized problem that remains convex, with theoretical convergence guarantees. The approach yields interpretable cost vectors and apprentice policies that balance fidelity to the prior with competitive performance relative to the expert, validated on gridworld experiments that reveal the importance of regularization. Overall, the work clarifies the IO-IRL-AL relationship, provides a practical SMD-based solver for regularized cost learning, and demonstrates improved learning of cost structures and apprentice behaviors when priors are available.

Abstract

The relationship between inverse reinforcement learning (IRL) and inverse optimization (IO) for Markov decision processes (MDPs) has been relatively underexplored in the literature, despite addressing the same problem. In this work, we revisit the relationship between the IO framework for MDPs, IRL, and apprenticeship learning (AL). We incorporate prior beliefs on the structure of the cost function into the IRL and AL problems, and demonstrate that the convex-analytic view of the AL formalism (Kamoutsi et al., 2021) emerges as a relaxation of our framework. Notably, the AL formalism is a special case in our framework when the regularization term is absent. Focusing on the suboptimal expert setting, we formulate the AL problem as a regularized min-max problem. The regularizer plays a key role in addressing the ill-posedness of IRL by guiding the search for plausible cost functions. To solve the resulting regularized-convex-concave-min-max problem, we use stochastic mirror descent (SMD) and establish convergence bounds for the proposed method. Numerical experiments highlight the critical role of regularization in learning cost vectors and apprentice policies.

Paper Structure

This paper contains 22 sections, 12 theorems, 59 equations, 7 figures, 1 algorithm.

Key Result

Proposition 1

It holds that, $\mathcal{F} = \{{\bm{\mu}}_{\pi} \mid \pi \in \Pi_0\}$. For every $\pi \in \Pi_0$, we have that ${\bm{\mu}}_{\pi} \in \mathcal{F}$. Moreover, for every feasible solution ${\bm{\mu}} \in \mathcal{F}$, we can obtain a stationary Markov policy $\pi_{{\bm{\mu}}} \in \Pi_0$ by $\pi_{{\bm{

Figures (7)

  • Figure 1: Illustration of the incorporation of $\hat{{\bm{c}}}$.
  • Figure 2: Illustration of \ref{['eq:IO-AL']}.
  • Figure 3: Illustration of the Gridworld environment, the optimal policy, and the expert's policy.
  • Figure 4: Effect of the regularization on the cost vector.
  • Figure 5: Effect of regularization on the apprentice policy.
  • ...and 2 more figures

Theorems & Definitions (22)

  • Proposition 1: Puterman1994
  • Proposition 2: Complementary slackness
  • Theorem 1: cf. Proposition 2 in Kamoutsi2021
  • Corollary 1: Optimal expert
  • Proposition 3: Suboptimal expert
  • proof
  • Definition 1: $\epsilon$-approximate solution
  • Definition 2: Bounded estimator
  • Lemma 1
  • Lemma 2
  • ...and 12 more