Augmenting Offline Reinforcement Learning with State-only Interactions

Shangzhe Li; Xinhua Zhang

Augmenting Offline Reinforcement Learning with State-only Interactions

Shangzhe Li, Xinhua Zhang

TL;DR

This paper first leverages online interactions to generate high-return trajectories via conditional diffusion models, which are then blended with the original offline trajectories through a stitching algorithm, and the resulting augmented data can be applied generically to downstream reinforcement learners.

Abstract

Batch offline data have been shown considerably beneficial for reinforcement learning. Their benefit is further amplified by upsampling with generative models. In this paper, we consider a novel opportunity where interaction with environment is feasible, but only restricted to observations, i.e., \textit{no reward} feedback is available. This setting is broadly applicable, as simulators or even real cyber-physical systems are often accessible, while in contrast reward is often difficult or expensive to obtain. As a result, the learner must make good sense of the offline data to synthesize an efficient scheme of querying the transition of state. Our method first leverages online interactions to generate high-return trajectories via conditional diffusion models. They are then blended with the original offline trajectories through a stitching algorithm, and the resulting augmented data can be applied generically to downstream reinforcement learners. Superior empirical performance is demonstrated over state-of-the-art data augmentation methods that are extended to utilize state-only interactions.

Augmenting Offline Reinforcement Learning with State-only Interactions

TL;DR

Abstract

Paper Structure (35 sections, 18 equations, 13 figures, 5 tables, 6 algorithms)

This paper contains 35 sections, 18 equations, 13 figures, 5 tables, 6 algorithms.

Introduction
Related Work
Preliminary
Diffusion Probabilistic Models (DDPM)
Guided Diffusion
Trajectory Generator of DITS with State-only Interaction
Conditional Diffusion Models for Decision Making
Trajectory generation in DITS via state-only interaction
Conditioning with Classifier-free Guidance and the Training Objective
Inverse Dynamics Models (IDMs)
Stitcher of DITS
Forward Dynamics Criterion
Experiments
Environments
Baseline data augmentation methods
...and 20 more sections

Figures (13)

Figure 1: An example that illustrates the difference between four data augmentation methods. (a): the original offline dataset, where a river is in the main diagonal with two bridges. There are six random trajectories. Two of them walk to the Goal state (top right), which gives a high reward. Four trajectories are stuck at the lower-left half with a low reward. (b): MBTS concatenates trajectories, but they are restricted to original trajectories, instead of leveraging the bridge-crossing one that could be synthesized by a generative model. (c): the SynthER method where all transitions make an equal contribution to the training of the diffusion model. (d): the conditional diffusion model which super-samples high-return trajectory regions. As a result, new trajectories that cross the bridges are formed. (e): our DITS method of conditional diffusion followed by stitching, which concatenates low-return trajectories with high-return ones. In consequence, two dashed arrows are formed allowing two low-return trajectories to be connected to a bridge-crossing one.
Figure 2: Diffusion-based Trajectory Stitching (DITS). Trajectories from DITS' trajectory generator are combined with the original dataset for the stitching process, which creates new transitions (blue arrows in the "Trajectory stitching" block) and discards old transitions (grey dotted edges). Stitching was facilitated by DITS' reward and action generators. A filter is applied to prune away low-return trajectories, and the result is finally used by a downstream RL method.
Figure 3: Trajectory generation of DITS. Given the latest $C$ number of states $s_{t-C+1},\ldots, s_t$, the diffuser uses classifier-free guidance with low-temperature sampling to generate a sequence of future states and rewards. IDM is applied to generate action $a_t$ with $s_t$ and $s_{t+1}$. $\emptyset$ means the value is left for the diffuser to fill in, instead of clamping with the observation (if available).
Figure 4: Ablation study on trajectory generation and stitching. We show the contribution of the two components of DITS by comparing its normalized average return on Hopper with two variants: no stitcher (dropping the stitching step) and no generator (dotted line, using the original dataset).
Figure 5: Ablation on the number of DITS generated trajectory
...and 8 more figures

Augmenting Offline Reinforcement Learning with State-only Interactions

TL;DR

Abstract

Augmenting Offline Reinforcement Learning with State-only Interactions

Authors

TL;DR

Abstract

Table of Contents

Figures (13)