Constrained Intrinsic Motivation for Reinforcement Learning

Xiang Zheng; Xingjun Ma; Chao Shen; Cong Wang

Constrained Intrinsic Motivation for Reinforcement Learning

Xiang Zheng, Xingjun Ma, Chao Shen, Cong Wang

TL;DR

This work tackles two core problems in intrinsic motivation for RL: designing an effective intrinsic objective for reward-free pre-training and mitigating the bias intrinsic rewards introduce during downstream task learning. It introduces Constrained Intrinsic Motivation (CIM), which for RFPT maximizes a lower bound on the conditional state entropy $H(\phi(\mathbf{s})|\mathbf{z})$ under an alignment constraint on the state encoder, and for EIM uses a Lagrangian-based adaptive coefficient to adjust the intrinsic objective during policy optimization. The proposed approach yields superior skill diversity, state coverage, and fine-tuning efficiency on MuJoCo robotics tasks, and it effectively reduces intrinsic-bias effects when task rewards are available from the outset. The authors also provide theoretical constructs (alignment loss, lower-bounding mutual information) and practical estimation methods (xi-nearest neighbor) to support scalable implementation, with code available at the project URL.

Abstract

This paper investigates two fundamental problems that arise when utilizing Intrinsic Motivation (IM) for reinforcement learning in Reward-Free Pre-Training (RFPT) tasks and Exploration with Intrinsic Motivation (EIM) tasks: 1) how to design an effective intrinsic objective in RFPT tasks, and 2) how to reduce the bias introduced by the intrinsic objective in EIM tasks. Existing IM methods suffer from static skills, limited state coverage, sample inefficiency in RFPT tasks, and suboptimality in EIM tasks. To tackle these problems, we propose \emph{Constrained Intrinsic Motivation (CIM)} for RFPT and EIM tasks, respectively: 1) CIM for RFPT maximizes the lower bound of the conditional state entropy subject to an alignment constraint on the state encoder network for efficient dynamic and diverse skill discovery and state coverage maximization; 2) CIM for EIM leverages constrained policy optimization to adaptively adjust the coefficient of the intrinsic objective to mitigate the distraction from the intrinsic objective. In various MuJoCo robotics environments, we empirically show that CIM for RFPT greatly surpasses fifteen IM methods for unsupervised skill discovery in terms of skill diversity, state coverage, and fine-tuning performance. Additionally, we showcase the effectiveness of CIM for EIM in redeeming intrinsic rewards when task rewards are exposed from the beginning. Our code is available at https://github.com/x-zheng16/CIM.

Constrained Intrinsic Motivation for Reinforcement Learning

TL;DR

under an alignment constraint on the state encoder, and for EIM uses a Lagrangian-based adaptive coefficient to adjust the intrinsic objective during policy optimization. The proposed approach yields superior skill diversity, state coverage, and fine-tuning efficiency on MuJoCo robotics tasks, and it effectively reduces intrinsic-bias effects when task rewards are available from the outset. The authors also provide theoretical constructs (alignment loss, lower-bounding mutual information) and practical estimation methods (xi-nearest neighbor) to support scalable implementation, with code available at the project URL.

Abstract

Paper Structure (26 sections, 12 equations, 3 figures, 6 tables)

This paper contains 26 sections, 12 equations, 3 figures, 6 tables.

Introduction
Preliminaries
Markov Decision Processes
Reward-Free Pre-Training and Exploration
Intrinsic Motivation Methods
Constrained Intrinsic Motivation
Constrained Intrinsic Motivation for RFPT
Problems of Previous Intrinsic Motivation Methods
Intrinsic objective.
Alignment constraint.
Design of Constrained Intrinsic Objective
Estimation of Conditional State Entropy
Lower bound of conditional state entropy.
Intrinsic reward.
Constrained Intrinsic Motivation for EIM
...and 11 more sections

Figures (3)

Figure 1: Visualization of 2D continuous locomotion skills in Ant. Each color of the trajectories in competence-based IM methods (in blue) represents the direction of the latent skill variable $z$.
Figure 2: Visualization of 2D continuous manipulation skills discovered by various IM methods in FetchSlide. Each color of the trajectories in competence-based IM methods (in blue) represents the direction of the latent skill variable $z$.
Figure 3: \ref{['fig: maze-a']} Discrete CIM with $n_z=8$ in Ant. \ref{['fig: maze-b']} Trajectory visualization of the meta-controller where the color of each sub-trajectory reflects the direction of the skill. \ref{['fig: maze-c']} Learning curves using different coefficients of the intrinsic objective.

Constrained Intrinsic Motivation for Reinforcement Learning

TL;DR

Abstract

Constrained Intrinsic Motivation for Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (3)