Imitating Cost-Constrained Behaviors in Reinforcement Learning

Qian Shao; Pradeep Varakantham; Shih-Fen Cheng

Imitating Cost-Constrained Behaviors in Reinforcement Learning

Qian Shao, Pradeep Varakantham, Shih-Fen Cheng

TL;DR

This paper tackles imitation learning when expert behavior is governed by both rewards and cost constraints within a constrained Markov decision process. It introduces three scalable methods: a Lagrangian-based CCIL with a three-way gradient update, a meta-gradient approach (MALM) that tunes Lagrangian penalties online, and a cost-violation-based alternating gradient (CVAG) that adaptively prioritizes reward or cost minimization based on feasibility. The methods are evaluated on Safety Gym and MuJoCo environments, showing that unconstrained imitation learners struggle to respect costs while the proposed approaches achieve a favorable balance between imitation accuracy and constraint satisfaction, with MALM often performing best overall. The work advances practical imitation learning for cost-sensitive domains and provides a foundation for principled optimization of constraint handling in IRL/GAIL-like frameworks.

Abstract

Complex planning and scheduling problems have long been solved using various optimization or heuristic approaches. In recent years, imitation learning that aims to learn from expert demonstrations has been proposed as a viable alternative to solving these problems. Generally speaking, imitation learning is designed to learn either the reward (or preference) model or directly the behavioral policy by observing the behavior of an expert. Existing work in imitation learning and inverse reinforcement learning has focused on imitation primarily in unconstrained settings (e.g., no limit on fuel consumed by the vehicle). However, in many real-world domains, the behavior of an expert is governed not only by reward (or preference) but also by constraints. For instance, decisions on self-driving delivery vehicles are dependent not only on the route preferences/rewards (depending on past demand data) but also on the fuel in the vehicle and the time available. In such problems, imitation learning is challenging as decisions are not only dictated by the reward model but are also dependent on a cost-constrained model. In this paper, we provide multiple methods that match expert distributions in the presence of trajectory cost constraints through (a) Lagrangian-based method; (b) Meta-gradients to find a good trade-off between expected return and minimizing constraint violation; and (c) Cost-violation-based alternating gradient. We empirically show that leading imitation learning approaches imitate cost-constrained behaviors poorly and our meta-gradient-based approach achieves the best performance.

Imitating Cost-Constrained Behaviors in Reinforcement Learning

TL;DR

Abstract

Paper Structure (22 sections, 4 theorems, 40 equations, 3 figures, 4 tables, 3 algorithms)

This paper contains 22 sections, 4 theorems, 40 equations, 3 figures, 4 tables, 3 algorithms.

Introduction
Background and Related Work
Constrained Markov Decision Process
Imitation Learning
Lagrangian Method
Meta-Gradient for Lagrangian Approach
Cost-Violation-based Alternating Gradient
Experiments
Setup
Results
Conclusion
Acknowledgments
A. Theoretical Analysis
Step 1
Step 2
...and 7 more sections

Key Result

Theorem 1

The objective function of the cost-constrained imitation learning problem is: where $\psi^*(\rho_\pi - \rho_{\pi_E}) = \max \limits_{D,\lambda} \mathbb{E}_\pi[\log(D(s,a))]+ \mathbb{E}_{\pi_E}[\log(1-D(s,a))] + \lambda (\mathbb{E}_\pi[d(s,a)] -\mathbb{E}_{\pi_E}[d(s,a)])$

Figures (3)

Figure 1: Performance of Safety Gym environments. The x-axes indicate the number of iterations, and the y-axes indicate the performance of the agent, including average rewards/costs/cost rates with standard deviations.
Figure A.1: Performance of MuJoCo environments. The x-axes indicate the number of iterations, and the y-axes indicate the performance of the agent, including average rewards/costs/cost rates with standard deviations.
Figure A.2: Performance of Humanoid and DoggoButton tasks The x-axes indicate the number of iterations, and the y-axes indicate the performance of the agent, including average rewards/costs/cost rates with standard deviations.

Theorems & Definitions (4)

Theorem 1
Proposition 1
Proposition 2
Proposition 3

Imitating Cost-Constrained Behaviors in Reinforcement Learning

TL;DR

Abstract

Imitating Cost-Constrained Behaviors in Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (4)