Temporal Difference Learning with Constrained Initial Representations

Jiafei Lyu; Jingwen Yang; Zhongjian Qiao; Runze Liu; Zeyuan Liu; Deheng Ye; Zongqing Lu; Xiu Li

Temporal Difference Learning with Constrained Initial Representations

Jiafei Lyu, Jingwen Yang, Zhongjian Qiao, Runze Liu, Zeyuan Liu, Deheng Ye, Zongqing Lu, Xiu Li

TL;DR

This work tackles the challenge of sample-efficient off-policy reinforcement learning by constraining the initial input representations. It introduces CIR, which combines a Tanh-activated initial layer with AvgRNorm and LayerNorm, a U-shaped skip-connected critic, and convex Q-learning to stabilize training and improve value estimation. The authors provide theoretical results showing preserved linear independence, reduced gradient variance, and convergence guarantees for TD(0) with tanh-transformed features under regularization. Empirically, CIR achieves competitive or superior performance across DeepMind Control, HumanoidBench, and ODRL benchmarks while offering favorable compute efficiency, underscoring the value of architectural constraints for data-efficient RL.

Abstract

Recently, there have been numerous attempts to enhance the sample efficiency of off-policy reinforcement learning (RL) agents when interacting with the environment, including architecture improvements and new algorithms. Despite these advances, they overlook the potential of directly constraining the initial representations of the input data, which can intuitively alleviate the distribution shift issue and stabilize training. In this paper, we introduce the Tanh function into the initial layer to fulfill such a constraint. We theoretically unpack the convergence property of the temporal difference learning with the Tanh function under linear function approximation. Motivated by theoretical insights, we present our Constrained Initial Representations framework, tagged CIR, which is made up of three components: (i) the Tanh activation along with normalization methods to stabilize representations; (ii) the skip connection module to provide a linear pathway from the shallow layer to the deep layer; (iii) the convex Q-learning that allows a more flexible value estimate and mitigates potential conservatism. Empirical results show that CIR exhibits strong performance on numerous continuous control tasks, even being competitive or surpassing existing strong baseline methods.

Temporal Difference Learning with Constrained Initial Representations

TL;DR

Abstract

Paper Structure (34 sections, 11 theorems, 58 equations, 14 figures, 18 tables, 1 algorithm)

This paper contains 34 sections, 11 theorems, 58 equations, 14 figures, 18 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
Theoretical Analysis
Algorithm
Constraining Initial Representation with the Tanh Activation
Boosting Gradient Flow with Skip Connections
Convex Q-learning
Experiments
Main Results
Extended Results on Complex Tasks
Ablation Study
Scaling Results
Conclusion and Future Opportunities
Missing Proofs
...and 19 more sections

Key Result

Theorem 4.2

Suppose that $W=c\cdot I$, where $c\in\mathbb{R}$ is a sufficient small positive number, $I$ is the identity matrix, then if basis functions $\{\phi_1,\ldots,\phi_S\}$ are linearly independent, then the transformed basis functions $\{{\rm{tanh}}(W\phi_1),\ldots,{\rm{tanh}}(W\phi_S)\}$ are also linea

Figures (14)

Figure 1: Illustration of our motivation. When encountering the out-of-distribution (OOD) sample that deviates far from the distribution of the current policy, the vanilla MLP network may incur severe distribution shift issue (left) while after adding the Tanh activation to the MLP, the negative influence of the OOD sample can be mitigated (right).
Figure 2: Architecture overview of CIR. CIR adopts (a) the AvgRNorm module and the Tanh activation to stabilize initial representations; (b) skip connection modules to facilitate gradient flow.
Figure 3: Visualizations of the benchmarks. We consider tasks from the DMC suite, HumanoidBench and ODRL for evaluations. These tasks feature varying complexity and can be challenging.
Figure 4: Comparison of CIR against baselines. The average episode return is compared on DMC tasks while normalized return results are compared in HumanoidBench and ODRL tasks.
Figure 5: Left: Ablation study on network components and algorithmic components in CIR. Right: Ablation study on Tanh and skip connection module in CIR. SC denotes skip connection.
...and 9 more figures

Theorems & Definitions (16)

Theorem 4.2: Linear Independence
Theorem 4.3: Variance Reduction
Theorem 4.4: Convergence under TD(0)
Theorem 4.5: Global Convergence under Regularization
Theorem 5.1
Theorem A.1: Linear Independence
proof
Theorem A.2: Variance Reduction
proof
Theorem A.3: Convergence under TD(0)
...and 6 more

Temporal Difference Learning with Constrained Initial Representations

TL;DR

Abstract

Temporal Difference Learning with Constrained Initial Representations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (16)