Table of Contents
Fetching ...

Offline Reinforcement Learning via Inverse Optimization

Ioannis Dimanidis, Tolga Ok, Peyman Mohajerin Esfahani

TL;DR

The paper tackles offline RL under distribution shift in continuous control by coupling a non-causal MPC action-improvement step with inverse-optimization (IO) policy distillation using a convex sub-optimality loss. It introduces a robust disturbance-aware MPC (RMPC) with an exact convex reformulation to handle model-mismatch disturbances, and demonstrates that the IO hypothesis class can achieve competitive performance in low-data regimes with significantly fewer parameters than neural baselines. Empirical results on quadrotor control and MuJoCo benchmarks show strong data-efficiency and robustness, with public-release code to facilitate reproducibility. The work thus bridges robust control and offline RL, offering a practical, convex framework for distribution-shifted continuous-control tasks and opening avenues for real-time in-hindsight IO and RMPC extensions to offline settings.

Abstract

Inspired by the recent successes of Inverse Optimization (IO) across various application domains, we propose a novel offline Reinforcement Learning (ORL) algorithm for continuous state and action spaces, leveraging the convex loss function called ``sub-optimality loss" from the IO literature. To mitigate the distribution shift commonly observed in ORL problems, we further employ a robust and non-causal Model Predictive Control (MPC) expert steering a nominal model of the dynamics using in-hindsight information stemming from the model mismatch. Unlike the existing literature, our robust MPC expert enjoys an exact and tractable convex reformulation. In the second part of this study, we show that the IO hypothesis class, trained by the proposed convex loss function, enjoys ample expressiveness and achieves competitive performance comparing with the state-of-the-art (SOTA) methods in the low-data regime of the MuJoCo benchmark while utilizing three orders of magnitude fewer parameters, thereby requiring significantly fewer computational resources. To facilitate the reproducibility of our results, we provide an open-source package implementing the proposed algorithms and the experiments.

Offline Reinforcement Learning via Inverse Optimization

TL;DR

The paper tackles offline RL under distribution shift in continuous control by coupling a non-causal MPC action-improvement step with inverse-optimization (IO) policy distillation using a convex sub-optimality loss. It introduces a robust disturbance-aware MPC (RMPC) with an exact convex reformulation to handle model-mismatch disturbances, and demonstrates that the IO hypothesis class can achieve competitive performance in low-data regimes with significantly fewer parameters than neural baselines. Empirical results on quadrotor control and MuJoCo benchmarks show strong data-efficiency and robustness, with public-release code to facilitate reproducibility. The work thus bridges robust control and offline RL, offering a practical, convex framework for distribution-shifted continuous-control tasks and opening avenues for real-time in-hindsight IO and RMPC extensions to offline settings.

Abstract

Inspired by the recent successes of Inverse Optimization (IO) across various application domains, we propose a novel offline Reinforcement Learning (ORL) algorithm for continuous state and action spaces, leveraging the convex loss function called ``sub-optimality loss" from the IO literature. To mitigate the distribution shift commonly observed in ORL problems, we further employ a robust and non-causal Model Predictive Control (MPC) expert steering a nominal model of the dynamics using in-hindsight information stemming from the model mismatch. Unlike the existing literature, our robust MPC expert enjoys an exact and tractable convex reformulation. In the second part of this study, we show that the IO hypothesis class, trained by the proposed convex loss function, enjoys ample expressiveness and achieves competitive performance comparing with the state-of-the-art (SOTA) methods in the low-data regime of the MuJoCo benchmark while utilizing three orders of magnitude fewer parameters, thereby requiring significantly fewer computational resources. To facilitate the reproducibility of our results, we provide an open-source package implementing the proposed algorithms and the experiments.

Paper Structure

This paper contains 28 sections, 3 theorems, 31 equations, 2 figures, 2 tables, 1 algorithm.

Key Result

Lemma 3.1

Under linear nominal dynamics $f_0(x,u) = Ax +Bu$, and quadratic costs $c(x,u) = \norm{x}^2_{Q_x} + \norm{u}^2_{Q_u}$ and $c_f(x,u) = \norm{x}^2_{Q_f}$, where $Q_x, Q_f\succcurlyeq 0$ and $Q_u \succ 0$, the objective of (eq:nc_mpc_formulation) can be equivalently expressed by with $\mathbf{Q_{x}} = \mathop{\mathrm{blkdiag}}\limits (I_{N-1}\otimes Q_{x}, Q_{f})$, $\mathbf{Q}_{\mathbf{u}}=I_{N}\oti

Figures (2)

  • Figure 1: Comparisons of several agents in the quadrotor environment. Left: The cost histogram of the offline IO and CQL agents and online model-based MPC and model-free PPO-3M (trained with 3M environment steps) agents. Center: The cost distributions of CQL agents trained with 4 seeds on various dataset lengths compared to a single IO-RMPC policy trained with 3000 samples. Right: Comparison of the cost distributions between oblivious and full disturbance MPC policies against the model-free PPO agent.
  • Figure 3: Steady-state cost distributions (log-log scale) over 100 trials of the experiments described in Section \ref{['sec:fighter_dst']}. Dashed lines represent the median values. Left: MPC policies vs IO-MPC .Center: Difference in performance between the robust and non-robust version of IO policies when faced with distribution shift. Right: Performance of IO-RMPC vs MPC policies when faced with distribution shift.

Theorems & Definitions (8)

  • Remark 2.1: MPC computational costs
  • Remark 2.2: Validity of in-hindsight trajectories
  • Remark 2.3: Literature on disturbance feedback and non-causal control
  • Lemma 3.1: Vectorized MPC formulation for linear dynamics
  • Lemma 3.2: Exact polytopic representation of robust constraint set
  • Theorem 3.3: Exact SDP reformulation
  • Remark 3.4: Uncertainty set
  • Remark 3.5: Exploration vs exploitation