Table of Contents
Fetching ...

State-Action Inpainting Diffuser for Continuous Control with Delay

Dongqi Han, Wei Wang, Enze Zhang, Dongsheng Li

TL;DR

This study suggests a new methodology to advance the field of RL with delay by introducing State-Action Inpainting Diffuser (SAID), a framework that integrates the inductive bias of dynamics learning with the direct decision-making capability of policy optimization.

Abstract

Signal delay poses a fundamental challenge in continuous control and reinforcement learning (RL) by introducing a temporal gap between interaction and perception. Current solutions have largely evolved along two distinct paradigms: model-free approaches which utilize state augmentation to preserve Markovian properties, and model-based methods which focus on inferring latent beliefs via dynamics modeling. In this paper, we bridge these perspectives by introducing State-Action Inpainting Diffuser (SAID), a framework that integrates the inductive bias of dynamics learning with the direct decision-making capability of policy optimization. By formulating the problem as a joint sequence inpainting task, SAID implicitly captures environmental dynamics while directly generating consistent plans, effectively operating at the intersection of model-based and model-free paradigms. Crucially, this generative formulation allows SAID to be seamlessly applied to both online and offline RL. Extensive experiments on delayed continuous control benchmarks demonstrate that SAID achieves state-of-the-art and robust performance. Our study suggests a new methodology to advance the field of RL with delay.

State-Action Inpainting Diffuser for Continuous Control with Delay

TL;DR

This study suggests a new methodology to advance the field of RL with delay by introducing State-Action Inpainting Diffuser (SAID), a framework that integrates the inductive bias of dynamics learning with the direct decision-making capability of policy optimization.

Abstract

Signal delay poses a fundamental challenge in continuous control and reinforcement learning (RL) by introducing a temporal gap between interaction and perception. Current solutions have largely evolved along two distinct paradigms: model-free approaches which utilize state augmentation to preserve Markovian properties, and model-based methods which focus on inferring latent beliefs via dynamics modeling. In this paper, we bridge these perspectives by introducing State-Action Inpainting Diffuser (SAID), a framework that integrates the inductive bias of dynamics learning with the direct decision-making capability of policy optimization. By formulating the problem as a joint sequence inpainting task, SAID implicitly captures environmental dynamics while directly generating consistent plans, effectively operating at the intersection of model-based and model-free paradigms. Crucially, this generative formulation allows SAID to be seamlessly applied to both online and offline RL. Extensive experiments on delayed continuous control benchmarks demonstrate that SAID achieves state-of-the-art and robust performance. Our study suggests a new methodology to advance the field of RL with delay.
Paper Structure (28 sections, 1 theorem, 3 equations, 15 figures, 6 tables)

This paper contains 28 sections, 1 theorem, 3 equations, 15 figures, 6 tables.

Key Result

Theorem 2.1

By integrating historical actions $a_{< t}$ into the observation, the process becomes an MDP with state transition probability $\bar{\mathcal{P}}(\bar{s}_{t+1} \mid \bar{s}_t, a_t),$ where the augmented state is $\bar{s}_t = (\tilde{s}_t, a_{t-\Delta T : t-1})$.

Figures (15)

  • Figure 1: Illustration of our idea under case delay=3 as an example. Upper: inpainting in images. Bottom: state-action sequence inpainting for decision making with signal delay (our approach).
  • Figure 2: The being modeled state-action sequence (in the dashed rectangle) of our State-Action Inpainting Diffuser at environment timestep $t$ with delay $\Delta t$. Black color indicates input conditions and red color indicates the inpainting values.
  • Figure 3: Handling episode start by padding initial actions and states, using $t$=0 and $t$=1 as examples. The symbols are the same as those in Figure \ref{['fig:method']}.
  • Figure 4: The "compounding error" phenomenon of autoregressive state-transition models. See Fig. \ref{['fig:compound']} for more examples.
  • Figure 5: Performance (mean episodic return) for online RL with delay. One may notice that the experimental result of Hopper at delay=16 is higher than that at delay=8. This is also reported in Table 10 of the reference paper directlyforecastingbelief2025.
  • ...and 10 more figures

Theorems & Definitions (1)

  • Theorem 2.1: Recovering Markovian Property by State Augmentation