Efficient Recurrent Off-Policy RL Requires a Context-Encoder-Specific Learning Rate

Fan-Ming Luo; Zuolin Tu; Zefang Huang; Yang Yu

Efficient Recurrent Off-Policy RL Requires a Context-Encoder-Specific Learning Rate

Fan-Ming Luo, Zuolin Tu, Zefang Huang, Yang Yu

TL;DR

Real-world RL often operates under partial observability, demanding memory-based approaches. The authors introduce RESeL, which stabilizes recurrent off-policy learning by applying a context-encoder-specific learning rate to the RNN-based context encoder while keeping other layers at a standard rate, mitigating gradient amplification across long sequences. Implemented within SAC with a REDQ-style eight-critic ensemble and trained on full-length trajectories, RESeL demonstrates superior training stability and competitive or superior performance across 18 POMDP tasks and 5 MuJoCo MDPS, with ablations confirming the necessity of distinct LR settings for the context encoder. This approach enhances robustness of recurrent RL in diverse POMDP settings and extends practical applicability to complex decision-making problems with partial observability.

Abstract

Real-world decision-making tasks are usually partially observable Markov decision processes (POMDPs), where the state is not fully observable. Recent progress has demonstrated that recurrent reinforcement learning (RL), which consists of a context encoder based on recurrent neural networks (RNNs) for unobservable state prediction and a multilayer perceptron (MLP) policy for decision making, can mitigate partial observability and serve as a robust baseline for POMDP tasks. However, previous recurrent RL methods face training stability issues due to the gradient instability of RNNs. In this paper, we propose Recurrent Off-policy RL with Context-Encoder-Specific Learning Rate (RESeL) to tackle this issue. Specifically, RESeL uses a lower learning rate for context encoder than other MLP layers to ensure the stability of the former while maintaining the training efficiency of the latter. We integrate this technique into existing off-policy RL methods, resulting in the RESeL algorithm. We evaluated RESeL in 18 POMDP tasks, including classic, meta-RL, and credit assignment scenarios, as well as five MDP locomotion tasks. The experiments demonstrate significant improvements in training stability with RESeL. Comparative results show that RESeL achieves notable performance improvements over previous recurrent RL baselines in POMDP tasks, and is competitive with or even surpasses state-of-the-art methods in MDP tasks. Further ablation studies highlight the necessity of applying a distinct learning rate for the context encoder.

Efficient Recurrent Off-Policy RL Requires a Context-Encoder-Specific Learning Rate

TL;DR

Abstract

Paper Structure (38 sections, 2 theorems, 31 equations, 17 figures, 6 tables, 1 algorithm)

This paper contains 38 sections, 2 theorems, 31 equations, 17 figures, 6 tables, 1 algorithm.

Introduction
Background
Related Work
Method
Model Architectures
Stabilizing Training with a Context-Encoder-Specific Learning Rate
Training Procedure of RESeL
Experiments
Training Stability
Performance Comparisons
Sensitivity and Ablation Studies
Conclusions and Limitations
Proof Proposition \ref{['prop:output_change_over_time']}
Average Output Differences Bound
Algorithmic Details
...and 23 more sections

Key Result

Proposition 1

Assuming $f^\theta$ and $f^{\theta'}$ both satisfy Lipschitz continuity, i.e., for all $\hat{\theta} \in \{\theta, \theta'\}$, $x \in \mathcal{X}$, $h, h' \in \mathcal{H}$, there exist constants $K_h \in [0, 1)$ and $K_y \in \mathbb{R}$ such that: and for all $x \in \mathcal{X}$, $h \in \mathcal{H}$, the output differences between the RNN parameterized by $\theta$ and $\theta'$ are bounded by a c

Figures (17)

Figure 1: A simple recurrent policy architecture.
Figure 2: Policy and critic architectures of RESeL.
Figure 3: Policy output variations as rollout step increases after a one-step gradient-update with different ${\rm LR}_{\rm CE}$ and ${\rm LR}_{\rm other}$.
Figure 4: Learning curves in four classic POMDP tasks with ${\rm LR}_{\rm CE}=3e-4$ or ${\rm LR}_{\rm CE}=1e-5$, shaded with one standard error. We fixed ${\rm LR}_{\rm other}=3e-4$. The learning curves in AntBLT-V and HalfCheetahBLT-V are incomplete as some runs encountered infinite or NaN outputs.
Figure 5: Learning curves shaded with $1$ standard error in classic POMDP tasks.
...and 12 more figures

Theorems & Definitions (4)

Proposition 1
proof
Proposition 2
proof

Efficient Recurrent Off-Policy RL Requires a Context-Encoder-Specific Learning Rate

TL;DR

Abstract

Efficient Recurrent Off-Policy RL Requires a Context-Encoder-Specific Learning Rate

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (4)