Balance Equation-based Distributionally Robust Offline Imitation Learning
Rishabh Agrawal, Yusuf Alvi, Rahul Jain, Ashutosh Nayyar
TL;DR
This work tackles the robustness problem in offline imitation learning when deployment dynamics diverge from training dynamics. It introduces BE-DROIL, a Balance Equation–based Distributionally Robust Offline Imitation Learning framework that operates purely offline from nominal expert demonstrations by constraining admissible transition perturbations via an $f$-divergence–based occupancy ball. A triplet occupancy representation and strong duality yield a closed-form, data-driven importance-weighting objective that avoids explicit dependence on unknown dynamics, enabling scalable, offline optimization. Empirically, BE-DROIL delivers superior robustness across multiple MuJoCo benchmarks under diverse perturbations, with a fixed configuration across domains, highlighting principled generalization to transition shifts in a safe offline regime. The approach offers a principled path toward safer, more generalizable imitation learning in real-world robotics, with caveats about biases in expert data and the need for careful safety and drift monitoring.
Abstract
Imitation Learning (IL) has proven highly effective for robotic and control tasks where manually designing reward functions or explicit controllers is infeasible. However, standard IL methods implicitly assume that the environment dynamics remain fixed between training and deployment. In practice, this assumption rarely holds where modeling inaccuracies, real-world parameter variations, and adversarial perturbations can all induce shifts in transition dynamics, leading to severe performance degradation. We address this challenge through Balance Equation-based Distributionally Robust Offline Imitation Learning, a framework that learns robust policies solely from expert demonstrations collected under nominal dynamics, without requiring further environment interaction. We formulate the problem as a distributionally robust optimization over an uncertainty set of transition models, seeking a policy that minimizes the imitation loss under the worst-case transition distribution. Importantly, we show that this robust objective can be reformulated entirely in terms of the nominal data distribution, enabling tractable offline learning. Empirical evaluations on continuous-control benchmarks demonstrate that our approach achieves superior robustness and generalization compared to state-of-the-art offline IL baselines, particularly under perturbed or shifted environments.
