Belief-Based Offline Reinforcement Learning for Delay-Robust Policy Optimization
Simon Sinong Zhan, Qingyuan Wu, Philip Wang, Frank Yang, Xiangyu Shi, Chao Huang, Qi Zhu
TL;DR
DT-CORL tackles the problem of deploying offline-trained RL agents in latency-prone environments by learning delay-robust policies from delay-free data. It introduces a transformer-based belief predictor that maps delayed observations to latent state representations and integrates this belief into a belief-based policy-iteration framework, enabling end-to-end offline learning without online interaction. The approach enforces policy constraints via Wasserstein-based regularization to mitigate out-of-distribution queries, and supports online adaptation by predicting latent states from delayed inputs. Empirical results on D4RL benchmarks show DT-CORL consistently outperforms history-augmentation and naive belief-based baselines across deterministic and stochastic delays, reducing the sim-to-real latency gap while preserving data efficiency. These findings highlight the practical impact of jointly optimizing belief estimation and policy learning for delay-robust offline RL in robotics and autonomous systems with uncertain latency.
Abstract
Offline-to-online deployment of reinforcement-learning (RL) agents must bridge two gaps: (1) the sim-to-real gap, where real systems add latency and other imperfections not present in simulation, and (2) the interaction gap, where policies trained purely offline face out-of-distribution states during online execution because gathering new interaction data is costly or risky. Agents therefore have to generalize from static, delay-free datasets to dynamic, delay-prone environments. Standard offline RL learns from delay-free logs yet must act under delays that break the Markov assumption and hurt performance. We introduce DT-CORL (Delay-Transformer belief policy Constrained Offline RL), an offline-RL framework built to cope with delayed dynamics at deployment. DT-CORL (i) produces delay-robust actions with a transformer-based belief predictor even though it never sees delayed observations during training, and (ii) is markedly more sample-efficient than naïve history-augmentation baselines. Experiments on D4RL benchmarks with several delay settings show that DT-CORL consistently outperforms both history-augmentation and vanilla belief-based methods, narrowing the sim-to-real latency gap while preserving data efficiency.
