Table of Contents
Fetching ...

Belief-Based Offline Reinforcement Learning for Delay-Robust Policy Optimization

Simon Sinong Zhan, Qingyuan Wu, Philip Wang, Frank Yang, Xiangyu Shi, Chao Huang, Qi Zhu

TL;DR

DT-CORL tackles the problem of deploying offline-trained RL agents in latency-prone environments by learning delay-robust policies from delay-free data. It introduces a transformer-based belief predictor that maps delayed observations to latent state representations and integrates this belief into a belief-based policy-iteration framework, enabling end-to-end offline learning without online interaction. The approach enforces policy constraints via Wasserstein-based regularization to mitigate out-of-distribution queries, and supports online adaptation by predicting latent states from delayed inputs. Empirical results on D4RL benchmarks show DT-CORL consistently outperforms history-augmentation and naive belief-based baselines across deterministic and stochastic delays, reducing the sim-to-real latency gap while preserving data efficiency. These findings highlight the practical impact of jointly optimizing belief estimation and policy learning for delay-robust offline RL in robotics and autonomous systems with uncertain latency.

Abstract

Offline-to-online deployment of reinforcement-learning (RL) agents must bridge two gaps: (1) the sim-to-real gap, where real systems add latency and other imperfections not present in simulation, and (2) the interaction gap, where policies trained purely offline face out-of-distribution states during online execution because gathering new interaction data is costly or risky. Agents therefore have to generalize from static, delay-free datasets to dynamic, delay-prone environments. Standard offline RL learns from delay-free logs yet must act under delays that break the Markov assumption and hurt performance. We introduce DT-CORL (Delay-Transformer belief policy Constrained Offline RL), an offline-RL framework built to cope with delayed dynamics at deployment. DT-CORL (i) produces delay-robust actions with a transformer-based belief predictor even though it never sees delayed observations during training, and (ii) is markedly more sample-efficient than naïve history-augmentation baselines. Experiments on D4RL benchmarks with several delay settings show that DT-CORL consistently outperforms both history-augmentation and vanilla belief-based methods, narrowing the sim-to-real latency gap while preserving data efficiency.

Belief-Based Offline Reinforcement Learning for Delay-Robust Policy Optimization

TL;DR

DT-CORL tackles the problem of deploying offline-trained RL agents in latency-prone environments by learning delay-robust policies from delay-free data. It introduces a transformer-based belief predictor that maps delayed observations to latent state representations and integrates this belief into a belief-based policy-iteration framework, enabling end-to-end offline learning without online interaction. The approach enforces policy constraints via Wasserstein-based regularization to mitigate out-of-distribution queries, and supports online adaptation by predicting latent states from delayed inputs. Empirical results on D4RL benchmarks show DT-CORL consistently outperforms history-augmentation and naive belief-based baselines across deterministic and stochastic delays, reducing the sim-to-real latency gap while preserving data efficiency. These findings highlight the practical impact of jointly optimizing belief estimation and policy learning for delay-robust offline RL in robotics and autonomous systems with uncertain latency.

Abstract

Offline-to-online deployment of reinforcement-learning (RL) agents must bridge two gaps: (1) the sim-to-real gap, where real systems add latency and other imperfections not present in simulation, and (2) the interaction gap, where policies trained purely offline face out-of-distribution states during online execution because gathering new interaction data is costly or risky. Agents therefore have to generalize from static, delay-free datasets to dynamic, delay-prone environments. Standard offline RL learns from delay-free logs yet must act under delays that break the Markov assumption and hurt performance. We introduce DT-CORL (Delay-Transformer belief policy Constrained Offline RL), an offline-RL framework built to cope with delayed dynamics at deployment. DT-CORL (i) produces delay-robust actions with a transformer-based belief predictor even though it never sees delayed observations during training, and (ii) is markedly more sample-efficient than naïve history-augmentation baselines. Experiments on D4RL benchmarks with several delay settings show that DT-CORL consistently outperforms both history-augmentation and vanilla belief-based methods, narrowing the sim-to-real latency gap while preserving data efficiency.

Paper Structure

This paper contains 32 sections, 4 theorems, 22 equations, 8 figures, 15 tables.

Key Result

Lemma 4.1

For policies $\pi_{\Delta^\tau}$ and $\pi_\Delta$, with $\Delta^{\tau}$$<$$\Delta$. Given any $x$$\in$$\mathcal{X}$, if $Q_{\Delta^\tau}$ is $L_Q$-LC, the performance difference between policies can be bounded as follows:

Figures (8)

  • Figure 1: Overall pipeline of DT-CORL. In the Offline Training phase, trajectory data are augmented to train the transformer belief, and with the trained transformer belief, we conduct belief-based PI in the offline setting. In the Online Adaptation, we utilize the transformer belief to predict the current state from delayed observation, and adapt with offline-trained policy.
  • Figure 2: Step-by-step detailed comparison of prediction accuracy for different models.
  • Figure 3: (a) Describes the average performance of Aug-CQL, Belief-CQL, and DT-CORL across three dexterous hand manipulation tasks (a)-(c) under various delay setting ranging from $4$ to $16$.
  • Figure 4: Ground Truth
  • Figure 5: Ensemble MLP
  • ...and 3 more figures

Theorems & Definitions (10)

  • Definition 3.1: Lipschitz Continuous Policy rl_lipschitz_continuous
  • Definition 3.2: Lipschitz Continuous MDP rl_lipschitz_continuous
  • Definition 3.3: Lipschitz Continuous Q-function rl_lipschitz_continuous
  • Lemma 4.1: Delayed Performance Difference Bound wu2024boosting
  • Lemma 4.2: Delayed Q-value Difference Boundwu2024boosting
  • Proposition 4.3
  • Remark 4.4
  • Remark 4.5
  • Lemma B.1
  • proof