Table of Contents
Fetching ...

DeepAveragers: Offline Reinforcement Learning by Solving Derived Non-Parametric MDPs

Aayam Shrestha, Stefan Lee, Prasad Tadepalli, Alan Fern

TL;DR

The paper addresses offline reinforcement learning by deriving a non-parametric MDP (DAC-MDP) from a fixed dataset and solving a finite core MDP to enable optimal planning on top of deep representations. It proves a Lipschitz-based bound on the DAC-MDP policy's performance relative to the true MDP, involving $L(k,C)$, $\bar{d}_{max}$, $Q_{max}$, and $\epsilon(k,N,\delta)$. A GPU-accelerated value-iteration solver scales to millions of states, and experiments on Atari-scale image-based tasks demonstrate improved or competitive performance and zero-shot adaptation to altered objectives. This work demonstrates principled integration of planning and representation learning for offline RL, with clear directions for richer representations and scalability.

Abstract

We study an approach to offline reinforcement learning (RL) based on optimally solving finitely-represented MDPs derived from a static dataset of experience. This approach can be applied on top of any learned representation and has the potential to easily support multiple solution objectives as well as zero-shot adjustment to changing environments and goals. Our main contribution is to introduce the Deep Averagers with Costs MDP (DAC-MDP) and to investigate its solutions for offline RL. DAC-MDPs are a non-parametric model that can leverage deep representations and account for limited data by introducing costs for exploiting under-represented parts of the model. In theory, we show conditions that allow for lower-bounding the performance of DAC-MDP solutions. We also investigate the empirical behavior in a number of environments, including those with image-based observations. Overall, the experiments demonstrate that the framework can work in practice and scale to large complex offline RL problems.

DeepAveragers: Offline Reinforcement Learning by Solving Derived Non-Parametric MDPs

TL;DR

The paper addresses offline reinforcement learning by deriving a non-parametric MDP (DAC-MDP) from a fixed dataset and solving a finite core MDP to enable optimal planning on top of deep representations. It proves a Lipschitz-based bound on the DAC-MDP policy's performance relative to the true MDP, involving , , , and . A GPU-accelerated value-iteration solver scales to millions of states, and experiments on Atari-scale image-based tasks demonstrate improved or competitive performance and zero-shot adaptation to altered objectives. This work demonstrates principled integration of planning and representation learning for offline RL, with clear directions for richer representations and scalability.

Abstract

We study an approach to offline reinforcement learning (RL) based on optimally solving finitely-represented MDPs derived from a static dataset of experience. This approach can be applied on top of any learned representation and has the potential to easily support multiple solution objectives as well as zero-shot adjustment to changing environments and goals. Our main contribution is to introduce the Deep Averagers with Costs MDP (DAC-MDP) and to investigate its solutions for offline RL. DAC-MDPs are a non-parametric model that can leverage deep representations and account for limited data by introducing costs for exploiting under-represented parts of the model. In theory, we show conditions that allow for lower-bounding the performance of DAC-MDP solutions. We also investigate the empirical behavior in a number of environments, including those with image-based observations. Overall, the experiments demonstrate that the framework can work in practice and scale to large complex offline RL problems.

Paper Structure

This paper contains 20 sections, 3 theorems, 24 equations, 16 figures, 2 algorithms.

Key Result

Theorem 3.1

For any data set $\mathcal{D}$ of size $N$, let $\tilde{Q}$ and $\tilde{\pi}$ be the optimal Q-function and policy for the corresponding DAC-MDP with parameters $k$ and $C$. If $B[\tilde{Q}]$ is Lipshitz continuous with constant $L(k,C)$, then with probability at least $1-\delta$, which for L2 distance over a $d$-dimensional space yields $\epsilon(k,N,\delta) = O\left(\sqrt{\frac{1}{k}\left(d\ln{

Figures (16)

  • Figure 1: Overview of Offline RL via DAC-MDPs. Given a static experience dataset, we first compile it into a finite tabular MDP which is at most the size of the dataset. This MDP contains the "core" states of the full continuous DAC-MDP. The finite core-state MDP is then solved via value iteration, resulting in a policy and Q-value function for the core states. This finite Q-function is used to define a non-parametric Q-function for the continuous DAC-MDP, which allows for Q-values and hence a policy to be computed for previously unseen states.
  • Figure 2: (a) Greedy Policy performance for CartPole with varying (a) cost paramter $C$. (b) smoothness parameter $k$. (c) policy smoothing parameter $k_\pi$
  • Figure 3: Results on Atari 100K (left)BCQ (right) DQN. Each agent is trained for 100K iterations(training steps), and evaluated on 10 episodes every 10K steps. At each of these evaluation checkpoints, we use the internal representation to compile DAC-MDPs. We then evaluate the DAC-MDPs for $N_e=6$. Runs averaged over 5 seeds and error bars plot the 95% confidence interval.
  • Figure 4: Results for different sets of candidate policies $N_e$ on 100K dataset. Here we plot the final performance of BCQ representation (left) and DQN representation (right) along with the DAC-MDP performances for different values of $N_e$. Runs averaged over 5 seeds. Error bars plot the 95% confidence interval.
  • Figure 5: Atari results for 2.5M dataset. We show the final performance of BCQ and DQN trained for 2.5M iterations. We also use the same representation for the DAC-MDPs named as DAC-BCQ and DAC-DQN respectively. All DAC-MDPs are evaluated with $N_e=6$.
  • ...and 11 more figures

Theorems & Definitions (5)

  • Theorem 3.1
  • Lemma A.1
  • Lemma A.2
  • proof
  • proof