Offline Reinforcement Learning via Inverse Optimization
Ioannis Dimanidis, Tolga Ok, Peyman Mohajerin Esfahani
TL;DR
The paper tackles offline RL under distribution shift in continuous control by coupling a non-causal MPC action-improvement step with inverse-optimization (IO) policy distillation using a convex sub-optimality loss. It introduces a robust disturbance-aware MPC (RMPC) with an exact convex reformulation to handle model-mismatch disturbances, and demonstrates that the IO hypothesis class can achieve competitive performance in low-data regimes with significantly fewer parameters than neural baselines. Empirical results on quadrotor control and MuJoCo benchmarks show strong data-efficiency and robustness, with public-release code to facilitate reproducibility. The work thus bridges robust control and offline RL, offering a practical, convex framework for distribution-shifted continuous-control tasks and opening avenues for real-time in-hindsight IO and RMPC extensions to offline settings.
Abstract
Inspired by the recent successes of Inverse Optimization (IO) across various application domains, we propose a novel offline Reinforcement Learning (ORL) algorithm for continuous state and action spaces, leveraging the convex loss function called ``sub-optimality loss" from the IO literature. To mitigate the distribution shift commonly observed in ORL problems, we further employ a robust and non-causal Model Predictive Control (MPC) expert steering a nominal model of the dynamics using in-hindsight information stemming from the model mismatch. Unlike the existing literature, our robust MPC expert enjoys an exact and tractable convex reformulation. In the second part of this study, we show that the IO hypothesis class, trained by the proposed convex loss function, enjoys ample expressiveness and achieves competitive performance comparing with the state-of-the-art (SOTA) methods in the low-data regime of the MuJoCo benchmark while utilizing three orders of magnitude fewer parameters, thereby requiring significantly fewer computational resources. To facilitate the reproducibility of our results, we provide an open-source package implementing the proposed algorithms and the experiments.
