Model-Free Robust $φ$-Divergence Reinforcement Learning Using Both Offline and Online Data
Kishan Panaganti, Adam Wierman, Eric Mazumdar
TL;DR
The paper tackles robustness in reinforcement learning under model mismatch by developing model-free, offline RPQ for the RRMDP framework and a hybrid offline-online HyTQ for finite-horizon problems. RPQ leverages a dual representation of the robust Bellman operator to learn an ε-optimal robust policy using only nominal-model offline data across general φ-divergences. HyTQ extends this to a hybrid setting using TV-divergence duality and backward induction, achieving improved out-of-distribution handling with theoretical suboptimality and sample-complexity guarantees under bilinear and related realizability assumptions. Together, these results provide a unified, scalable treatment of robust RL in high-dimensional settings with broad divergence families, offering principled guidance for offline, online, or hybrid deployment in uncertain environments.
Abstract
The robust $φ$-regularized Markov Decision Process (RRMDP) framework focuses on designing control policies that are robust against parameter uncertainties due to mismatches between the simulator (nominal) model and real-world settings. This work makes two important contributions. First, we propose a model-free algorithm called Robust $φ$-regularized fitted Q-iteration (RPQ) for learning an $ε$-optimal robust policy that uses only the historical data collected by rolling out a behavior policy (with robust exploratory requirement) on the nominal model. To the best of our knowledge, we provide the first unified analysis for a class of $φ$-divergences achieving robust optimal policies in high-dimensional systems with general function approximation. Second, we introduce the hybrid robust $φ$-regularized reinforcement learning framework to learn an optimal robust policy using both historical data and online sampling. Towards this framework, we propose a model-free algorithm called Hybrid robust Total-variation-regularized Q-iteration (HyTQ: pronounced height-Q). To the best of our knowledge, we provide the first improved out-of-data-distribution assumption in large-scale problems with general function approximation under the hybrid robust $φ$-regularized reinforcement learning framework. Finally, we provide theoretical guarantees on the performance of the learned policies of our algorithms on systems with arbitrary large state space.
