Table of Contents
Fetching ...

Residuals-based Offline Reinforcement Learning

Qing Zhu, Xian Yu

Abstract

Offline reinforcement learning (RL) has received increasing attention for learning policies from previously collected data without interaction with the real environment, which is particularly important in high-stakes applications. While a growing body of work has developed offline RL algorithms, these methods often rely on restrictive assumptions about data coverage and suffer from distribution shift. In this paper, we propose a residuals-based offline RL framework for general state and action spaces. Specifically, we define a residuals-based Bellman optimality operator that explicitly incorporates estimation error in learning transition dynamics into policy optimization by leveraging empirical residuals. We show that this Bellman operator is a contraction mapping and identify conditions under which its fixed point is asymptotically optimal and possesses finite-sample guarantees. We further develop a residuals-based offline deep Q-learning (DQN) algorithm. Using a stochastic CartPole environment, we demonstrate the effectiveness of our residuals-based offline DQN algorithm.

Residuals-based Offline Reinforcement Learning

Abstract

Offline reinforcement learning (RL) has received increasing attention for learning policies from previously collected data without interaction with the real environment, which is particularly important in high-stakes applications. While a growing body of work has developed offline RL algorithms, these methods often rely on restrictive assumptions about data coverage and suffer from distribution shift. In this paper, we propose a residuals-based offline RL framework for general state and action spaces. Specifically, we define a residuals-based Bellman optimality operator that explicitly incorporates estimation error in learning transition dynamics into policy optimization by leveraging empirical residuals. We show that this Bellman operator is a contraction mapping and identify conditions under which its fixed point is asymptotically optimal and possesses finite-sample guarantees. We further develop a residuals-based offline deep Q-learning (DQN) algorithm. Using a stochastic CartPole environment, we demonstrate the effectiveness of our residuals-based offline DQN algorithm.

Paper Structure

This paper contains 13 sections, 9 theorems, 30 equations, 3 figures, 1 algorithm.

Key Result

Theorem 1

For $0\le \gamma<1$, the residuals-based Bellman optimality operator $\hat{T}_N$ defined in eq:empirical bellman operator and the full-information Bellman operator $T^\star_N$ defined in eq:full info bellman operator are $\gamma$-contraction mappings, i.e., $\forall\, Q^{1},\, Q^{2} \in \mathcal{B}( $\blacktriangleleft$$\blacktriangleleft$

Figures (3)

  • Figure C1: Flowchart of the residuals-based offline RL.
  • Figure D1: Comparison of different sample sizes
  • Figure D2: Comparison of Models

Theorems & Definitions (17)

  • Theorem 1
  • proof
  • Proposition 1: Lipschitz continuity of $Q_N^\star$
  • proof
  • Proposition 2: Lipschitz continuity of $Q^\star$
  • proof
  • Theorem 2: Consistency of fixed point $\hat{Q}_N$
  • proof
  • Corollary 1: Consistency of the value function $\hat{V}_N$
  • proof
  • ...and 7 more