Episodic Reinforcement Learning with Expanded State-reward Space

Dayang Liang; Yaru Zhang; Yunlong Liu

Episodic Reinforcement Learning with Expanded State-reward Space

Dayang Liang, Yaru Zhang, Yunlong Liu

TL;DR

The paper addresses data inefficiency in deep reinforcement learning by augmenting episodic control with an expanded state-reward space. It reuses retrieved past states as part of the input and integrats retrieved MC-returns into the immediate reward, within a Soft Actor-Critic framework. Empirical results on Box2D and MuJoCo show superior performance and reduced Q-value overestimation compared with baselines like EMAC, with ablations revealing task-dependent optimal balance between past and current information. This approach enhances sample efficiency and reliability of value estimates, offering a practical path to more data-efficient DRL in continuous control domains.

Abstract

Empowered by deep neural networks, deep reinforcement learning (DRL) has demonstrated tremendous empirical successes in various domains, including games, health care, and autonomous driving. Despite these advancements, DRL is still identified as data-inefficient as effective policies demand vast numbers of environmental samples. Recently, episodic control (EC)-based model-free DRL methods enable sample efficiency by recalling past experiences from episodic memory. However, existing EC-based methods suffer from the limitation of potential misalignment between the state and reward spaces for neglecting the utilization of (past) retrieval states with extensive information, which probably causes inaccurate value estimation and degraded policy performance. To tackle this issue, we introduce an efficient EC-based DRL framework with expanded state-reward space, where the expanded states used as the input and the expanded rewards used in the training both contain historical and current information. To be specific, we reuse the historical states retrieved by EC as part of the input states and integrate the retrieved MC-returns into the immediate reward in each interactive transition. As a result, our method is able to simultaneously achieve the full utilization of retrieval information and the better evaluation of state values by a Temporal Difference (TD) loss. Empirical results on challenging Box2d and Mujoco tasks demonstrate the superiority of our method over a recent sibling method and common baselines. Further, we also verify our method's effectiveness in alleviating Q-value overestimation by additional experiments of Q-value comparison.

Episodic Reinforcement Learning with Expanded State-reward Space

TL;DR

Abstract

Paper Structure (16 sections, 15 equations, 7 figures, 2 tables)

This paper contains 16 sections, 15 equations, 7 figures, 2 tables.

Related Work
Background
Soft Actor Critic
Gaussian Random Projection
Episodic Control
Method
Overall Architecture
Episodic Retrieval
Optimization Implementation
Space Alignment
Experiments
Environments
Main Result
Q-value Overestimation
Ablation Study
...and 1 more sections

Figures (7)

Figure 1: Algorithm structure. The episodic control-based reinforcement learning approach with expanded state-reward space.
Figure 2: Structure of episodic retrieval module
Figure 3: Two match relationships between state and reward space during value back-propagation.
Figure 4: Illustrations of the experimental environments. From left to right: Pusher-v2, LunarLanderContinuous-v2, InvertedPendulum-v2, Walker2d-v3, HalfCheetah-v3, and Hopper-v3.
Figure 5: Performance comparison for 100K environment steps (20k steps in InvertedPendulum-v2) on Mujoco and Box2d tasks. For every curve, the mean episode rewards are computed every 1000 environment steps (100 steps in InvertedPendulum-v2), averaging over 10 episodes. Each curve is averaged over 10 seeds and is smoothed for visual clarity.
...and 2 more figures

Episodic Reinforcement Learning with Expanded State-reward Space

TL;DR

Abstract

Episodic Reinforcement Learning with Expanded State-reward Space

Authors

TL;DR

Abstract

Table of Contents

Figures (7)