Table of Contents
Fetching ...

Model-Free Robust $φ$-Divergence Reinforcement Learning Using Both Offline and Online Data

Kishan Panaganti, Adam Wierman, Eric Mazumdar

TL;DR

The paper tackles robustness in reinforcement learning under model mismatch by developing model-free, offline RPQ for the RRMDP framework and a hybrid offline-online HyTQ for finite-horizon problems. RPQ leverages a dual representation of the robust Bellman operator to learn an ε-optimal robust policy using only nominal-model offline data across general φ-divergences. HyTQ extends this to a hybrid setting using TV-divergence duality and backward induction, achieving improved out-of-distribution handling with theoretical suboptimality and sample-complexity guarantees under bilinear and related realizability assumptions. Together, these results provide a unified, scalable treatment of robust RL in high-dimensional settings with broad divergence families, offering principled guidance for offline, online, or hybrid deployment in uncertain environments.

Abstract

The robust $φ$-regularized Markov Decision Process (RRMDP) framework focuses on designing control policies that are robust against parameter uncertainties due to mismatches between the simulator (nominal) model and real-world settings. This work makes two important contributions. First, we propose a model-free algorithm called Robust $φ$-regularized fitted Q-iteration (RPQ) for learning an $ε$-optimal robust policy that uses only the historical data collected by rolling out a behavior policy (with robust exploratory requirement) on the nominal model. To the best of our knowledge, we provide the first unified analysis for a class of $φ$-divergences achieving robust optimal policies in high-dimensional systems with general function approximation. Second, we introduce the hybrid robust $φ$-regularized reinforcement learning framework to learn an optimal robust policy using both historical data and online sampling. Towards this framework, we propose a model-free algorithm called Hybrid robust Total-variation-regularized Q-iteration (HyTQ: pronounced height-Q). To the best of our knowledge, we provide the first improved out-of-data-distribution assumption in large-scale problems with general function approximation under the hybrid robust $φ$-regularized reinforcement learning framework. Finally, we provide theoretical guarantees on the performance of the learned policies of our algorithms on systems with arbitrary large state space.

Model-Free Robust $φ$-Divergence Reinforcement Learning Using Both Offline and Online Data

TL;DR

The paper tackles robustness in reinforcement learning under model mismatch by developing model-free, offline RPQ for the RRMDP framework and a hybrid offline-online HyTQ for finite-horizon problems. RPQ leverages a dual representation of the robust Bellman operator to learn an ε-optimal robust policy using only nominal-model offline data across general φ-divergences. HyTQ extends this to a hybrid setting using TV-divergence duality and backward induction, achieving improved out-of-distribution handling with theoretical suboptimality and sample-complexity guarantees under bilinear and related realizability assumptions. Together, these results provide a unified, scalable treatment of robust RL in high-dimensional settings with broad divergence families, offering principled guidance for offline, online, or hybrid deployment in uncertain environments.

Abstract

The robust -regularized Markov Decision Process (RRMDP) framework focuses on designing control policies that are robust against parameter uncertainties due to mismatches between the simulator (nominal) model and real-world settings. This work makes two important contributions. First, we propose a model-free algorithm called Robust -regularized fitted Q-iteration (RPQ) for learning an -optimal robust policy that uses only the historical data collected by rolling out a behavior policy (with robust exploratory requirement) on the nominal model. To the best of our knowledge, we provide the first unified analysis for a class of -divergences achieving robust optimal policies in high-dimensional systems with general function approximation. Second, we introduce the hybrid robust -regularized reinforcement learning framework to learn an optimal robust policy using both historical data and online sampling. Towards this framework, we propose a model-free algorithm called Hybrid robust Total-variation-regularized Q-iteration (HyTQ: pronounced height-Q). To the best of our knowledge, we provide the first improved out-of-data-distribution assumption in large-scale problems with general function approximation under the hybrid robust -regularized reinforcement learning framework. Finally, we provide theoretical guarantees on the performance of the learned policies of our algorithms on systems with arbitrary large state space.
Paper Structure (20 sections, 28 theorems, 116 equations, 1 table, 2 algorithms)

This paper contains 20 sections, 28 theorems, 116 equations, 1 table, 2 algorithms.

Key Result

Proposition 1

Consider a robust $\varphi$-regularized MDP. For any $Q: \mathcal{S}\times\mathcal{A}\to [0, 1/(1-\gamma)]$, the robust regularized Bellman operator $\mathcal{T}$eq:robust-regularized-bellman-eq-primal can be equivalently written as where $V(s)=\max_{a\in\mathcal{A}} Q(s,a)$ and $\Theta\subset\mathbb{R}$ is some bounded real line which depends on $\varphi^*$.

Theorems & Definitions (50)

  • Proposition 1
  • Proposition 2
  • Theorem 1
  • Remark 1
  • Corollary 1
  • Corollary 2
  • Theorem 2
  • Remark 2
  • Remark 3
  • Lemma 1: levy2020large
  • ...and 40 more