Table of Contents
Fetching ...

Delayed Homomorphic Reinforcement Learning for Environments with Delayed Feedback

Jongsoo Lee, Jangwon Kim, Soohee Han

Abstract

Reinforcement learning in real-world systems is often accompanied by delayed feedback, which breaks the Markov assumption and impedes both learning and control. Canonical state augmentation approaches cause the state-space explosion, which introduces a severe sample-complexity burden. Despite recent progress, the state-of-the-art augmentation-based baselines remain incomplete: they either predominantly reduce the burden on the critic or adopt non-unified treatments for the actor and critic. To provide a structured and sample-efficient solution, we propose delayed homomorphic reinforcement learning (DHRL), a framework grounded in MDP homomorphisms that collapses belief-equivalent augmented states and enables efficient policy learning on the resulting abstract MDP without loss of optimality. We provide theoretical analyses of state-space compression bounds and sample complexity, and introduce a practical algorithm. Experiments on continuous control tasks in MuJoCo benchmark confirm that our algorithm outperforms strong augmentation-based baselines, particularly under long delays.

Delayed Homomorphic Reinforcement Learning for Environments with Delayed Feedback

Abstract

Reinforcement learning in real-world systems is often accompanied by delayed feedback, which breaks the Markov assumption and impedes both learning and control. Canonical state augmentation approaches cause the state-space explosion, which introduces a severe sample-complexity burden. Despite recent progress, the state-of-the-art augmentation-based baselines remain incomplete: they either predominantly reduce the burden on the critic or adopt non-unified treatments for the actor and critic. To provide a structured and sample-efficient solution, we propose delayed homomorphic reinforcement learning (DHRL), a framework grounded in MDP homomorphisms that collapses belief-equivalent augmented states and enables efficient policy learning on the resulting abstract MDP without loss of optimality. We provide theoretical analyses of state-space compression bounds and sample complexity, and introduce a practical algorithm. Experiments on continuous control tasks in MuJoCo benchmark confirm that our algorithm outperforms strong augmentation-based baselines, particularly under long delays.

Paper Structure

This paper contains 38 sections, 12 theorems, 63 equations, 10 figures, 4 tables, 1 algorithm.

Key Result

Proposition 3.3

A partition $B_\Delta$ of $\mathcal{M}_\Delta$ induced by belief-equivalence relation is a reward-respecting SSP partition. $\blacktriangleleft$$\blacktriangleleft$

Figures (10)

  • Figure 1: Normalized number of Bellman backups of value iteration on the regular MDP (naive VI) and the abstract MDP (DHVI) with different delays $\Delta = \{2, 4, 6\}$. Naive VI is used as the baseline (normalized to $1.0$), and the numbers in parentheses indicate the actual number of Bellman backups until convergence.
  • Figure 2: Normalized performance (average returns) of augmented SAC and D$^2$HPG-naive with different delays in HalfCheetah-v3 MuJoCo task, where D$^2$HPG-naive is used as the baseline (normalized to $1.0$). Each algorithm was evaluated for one million time steps with 5 random seeds.
  • Figure 3: A schematic overview of D$^2$HPG, where we assume the homomorphic image of $\mathcal{M}_\Delta$ corresponds to $\mathcal{M}$.
  • Figure 4: Visual illustration of continuous control tasks in MuJoCo benchmark: (a) Ant-v3 (b) HalfCheetah-v3, (c) Walker2d-v3, (d) Hopper-v3, (e) Humanoid-v3, and (f) InvertedPendulum-v2
  • Figure 5: Performance curves of each algorithm on the MuJoCo benchmarks with $\Delta = 5$.
  • ...and 5 more figures

Theorems & Definitions (29)

  • Definition 2.1
  • Definition 3.1
  • Definition 3.2: belief-equivalence
  • Proposition 3.3
  • proof
  • Corollary 3.4: Preservation of optimality
  • proof : Proof sketch
  • Proposition 3.5
  • proof
  • Corollary 3.6
  • ...and 19 more