Tackling Data Corruption in Offline Reinforcement Learning via Sequence Modeling

Jiawei Xu; Rui Yang; Shuang Qiu; Feng Luo; Meng Fang; Baoxiang Wang; Lei Han

Tackling Data Corruption in Offline Reinforcement Learning via Sequence Modeling

Jiawei Xu, Rui Yang, Shuang Qiu, Feng Luo, Meng Fang, Baoxiang Wang, Lei Han

TL;DR

Offline RL often suffers from data corruption, especially with limited data. The paper shows that vanilla Decision Transformer (DT) can be surprisingly robust to corruption, and introduces Robust Decision Transformer (RDT) to further enhance resilience using embedding dropout, Gaussian weighted learning, and iterative data correction. Across MuJoCo, Kitchen, and Adroit, RDT outperforms TD-based methods and DT under random, adversarial, and mixed corruption, and remains robust to test-time observation perturbations. This work demonstrates the viability of sequence modeling for learning from noisy offline data and provides an accessible implementation for reproducibility.

Abstract

Learning policy from offline datasets through offline reinforcement learning (RL) holds promise for scaling data-driven decision-making while avoiding unsafe and costly online interactions. However, real-world data collected from sensors or humans often contains noise and errors, posing a significant challenge for existing offline RL methods, particularly when the real-world data is limited. Our study reveals that prior research focusing on adapting predominant offline RL methods based on temporal difference learning still falls short under data corruption when the dataset is limited. In contrast, we discover that vanilla sequence modeling methods, such as Decision Transformer, exhibit robustness against data corruption, even without specialized modifications. To unlock the full potential of sequence modeling, we propose Robust Decision Rransformer (RDT) by incorporating three simple yet effective robust techniques: embedding dropout to improve the model's robustness against erroneous inputs, Gaussian weighted learning to mitigate the effects of corrupted labels, and iterative data correction to eliminate corrupted data from the source. Extensive experiments on MuJoCo, Kitchen, and Adroit tasks demonstrate RDT's superior performance under various data corruption scenarios compared to prior methods. Furthermore, RDT exhibits remarkable robustness in a more challenging setting that combines training-time data corruption with test-time observation perturbations. These results highlight the potential of sequence modeling for learning from noisy or corrupted offline datasets, thereby promoting the reliable application of offline RL in real-world scenarios. Our code is available at https://github.com/jiawei415/RobustDecisionTransformer.

Tackling Data Corruption in Offline Reinforcement Learning via Sequence Modeling

TL;DR

Abstract

Paper Structure (48 sections, 6 equations, 20 figures, 10 tables, 1 algorithm)

This paper contains 48 sections, 6 equations, 20 figures, 10 tables, 1 algorithm.

Introduction
Preliminaries
RL and Offline RL.
Decision Transformer (DT).
Data Corruption in Offline RL.
Sequence Modeling for Offline RL with Data Corruption
Motivating Example
Robust Decision Transformer
Embedding Dropout
Gaussian Weighted Learning
Iterative Data Correction
Experiments
Experimental Setups
Evaluation under Various Data Corruption
Results under Random Corruption.
...and 33 more sections

Figures (20)

Figure 1: Average normalized scores of offline RL algorithms under random data corruption across three MuJoCo tasks (halfcheetah, walker2d, and hopper) using "medium-replay-v2" datasets. Many offline RL algorithms experience substantial performance declines when subjected to data corruption. In contrast, DT demonstrated remarkable robustness, particularly in the $10\%$ data regime.
Figure 2: Framework of Robust Decision Transformer (RDT). RDT enhances the robustness of DT against data corruption by incorporating three components on top of DT: embedding dropout, Gaussian weighted learning, and iterative data correction.
Figure 3: (a) Comparing dropout methods under state attack: Embedding dropout outperforms directly dropping the entire state (DeFog) or dropping dimensions on the raw state. (b) Gaussian weighted learning under action attack: Gaussian weighted learning (DT w. GWL) alleviates overfitting to the corrupted data and slightly minimizes the loss on clean data. (c) Iterative data correction (DT w. IDC ) under action attack: The MSE between corrected and oracle data gradually decreases to near zero.
Figure 4: Results under (a) mixed random corruption and (b) mixed adversarial corruption.
Figure 5: Performance under various observation perturbation scales during the testing phase. All the algorithms are trained under mixed random corruption during the training phase.
...and 15 more figures

Tackling Data Corruption in Offline Reinforcement Learning via Sequence Modeling

TL;DR

Abstract

Tackling Data Corruption in Offline Reinforcement Learning via Sequence Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (20)