Table of Contents
Fetching ...

Inverse Reinforcement Learning With Constraint Recovery

Nirjhar Das, Arpan Chattopadhyay

TL;DR

This work addresses inverse reinforcement learning for constrained MDPs (CMDPs) with the goal of recovering both the reward $r(s)=\mathbf{w}_r^\top \Phi_r(s)$ and the constraint $c(s)=\mathbf{w}_c^\top \Phi_c(s)$ from demonstrations. It builds on the maximum entropy IRL framework, deriving a Boltzmann trajectory distribution $p^*(\tau|\mathbf{w}_r,\mathbf{w}_c) \propto \exp(\mathbf{w}_r^\top \Phi_r(\tau) - \lambda \mathbf{w}_c^\top \Phi_c(\tau))$ with partition function $Z(\mathbf{w}_r,\mathbf{w}_c)$, and casts the problem as a non-convex constrained optimization in $(\mathbf{w}_r, \mathbf{w}_c)$. An alternating optimization scheme is proposed, where each subproblem is convex and gradients are given by $\nabla_{\mathbf{w}_r}\mathcal{L} = -\tilde{\Phi}_r + \hat{\Phi}_r$ and $\nabla_{\mathbf{w}_c}\mathcal{L} = \lambda(\tilde{\Phi}_c - \hat{\Phi}_c)$, solved via Exponentiated Gradient Descent with a 1-norm simplex projection. Empirical results in a stochastic grid world show successful recovery of both the reward and the constraint, and recovered policies align with the true optimal policies, illustrating the method’s viability for safety-critical CMDP settings.

Abstract

In this work, we propose a novel inverse reinforcement learning (IRL) algorithm for constrained Markov decision process (CMDP) problems. In standard IRL problems, the inverse learner or agent seeks to recover the reward function of the MDP, given a set of trajectory demonstrations for the optimal policy. In this work, we seek to infer not only the reward functions of the CMDP, but also the constraints. Using the principle of maximum entropy, we show that the IRL with constraint recovery (IRL-CR) problem can be cast as a constrained non-convex optimization problem. We reduce it to an alternating constrained optimization problem whose sub-problems are convex. We use exponentiated gradient descent algorithm to solve it. Finally, we demonstrate the efficacy of our algorithm for the grid world environment.

Inverse Reinforcement Learning With Constraint Recovery

TL;DR

This work addresses inverse reinforcement learning for constrained MDPs (CMDPs) with the goal of recovering both the reward and the constraint from demonstrations. It builds on the maximum entropy IRL framework, deriving a Boltzmann trajectory distribution with partition function , and casts the problem as a non-convex constrained optimization in . An alternating optimization scheme is proposed, where each subproblem is convex and gradients are given by and , solved via Exponentiated Gradient Descent with a 1-norm simplex projection. Empirical results in a stochastic grid world show successful recovery of both the reward and the constraint, and recovered policies align with the true optimal policies, illustrating the method’s viability for safety-critical CMDP settings.

Abstract

In this work, we propose a novel inverse reinforcement learning (IRL) algorithm for constrained Markov decision process (CMDP) problems. In standard IRL problems, the inverse learner or agent seeks to recover the reward function of the MDP, given a set of trajectory demonstrations for the optimal policy. In this work, we seek to infer not only the reward functions of the CMDP, but also the constraints. Using the principle of maximum entropy, we show that the IRL with constraint recovery (IRL-CR) problem can be cast as a constrained non-convex optimization problem. We reduce it to an alternating constrained optimization problem whose sub-problems are convex. We use exponentiated gradient descent algorithm to solve it. Finally, we demonstrate the efficacy of our algorithm for the grid world environment.
Paper Structure (7 sections, 13 equations, 1 figure, 1 algorithm)

This paper contains 7 sections, 13 equations, 1 figure, 1 algorithm.

Figures (1)

  • Figure 1: Pictorial demonstration of the performance of our algorithm in a grid-world setting. The numeric values of rewards and costs are written inside the boxes. The arrows denote the optimal action in a state.