Table of Contents
Fetching ...

Adaptive Correlation-Weighted Intrinsic Rewards for Reinforcement Learning

Viet Bac Nguyen, Phuong Thai Nguyen

TL;DR

Experimental results demonstrate that ACWI consistently improves sample efficiency and learning stability compared to fixed intrinsic reward baselines, achieving superior performance with minimal computational overhead.

Abstract

We propose ACWI (Adaptive Correlation Weighted Intrinsic), an adaptive intrinsic reward scaling framework designed to dynamically balance intrinsic and extrinsic rewards for improved exploration in sparse reward reinforcement learning. Unlike conventional approaches that rely on manually tuned scalar coefficients, which often result in unstable or suboptimal performance across tasks, ACWI learns a state dependent scaling coefficient online. Specifically, ACWI introduces a lightweight Beta Network that predicts the intrinsic reward weight directly from the agent state through an encoder based architecture. The scaling mechanism is optimized using a correlation based objective that encourages alignment between the weighted intrinsic rewards and discounted future extrinsic returns. This formulation enables task adaptive exploration incentives while preserving computational efficiency and training stability. We evaluate ACWI on a suite of sparse reward environments in MiniGrid. Experimental results demonstrate that ACWI consistently improves sample efficiency and learning stability compared to fixed intrinsic reward baselines, achieving superior performance with minimal computational overhead.

Adaptive Correlation-Weighted Intrinsic Rewards for Reinforcement Learning

TL;DR

Experimental results demonstrate that ACWI consistently improves sample efficiency and learning stability compared to fixed intrinsic reward baselines, achieving superior performance with minimal computational overhead.

Abstract

We propose ACWI (Adaptive Correlation Weighted Intrinsic), an adaptive intrinsic reward scaling framework designed to dynamically balance intrinsic and extrinsic rewards for improved exploration in sparse reward reinforcement learning. Unlike conventional approaches that rely on manually tuned scalar coefficients, which often result in unstable or suboptimal performance across tasks, ACWI learns a state dependent scaling coefficient online. Specifically, ACWI introduces a lightweight Beta Network that predicts the intrinsic reward weight directly from the agent state through an encoder based architecture. The scaling mechanism is optimized using a correlation based objective that encourages alignment between the weighted intrinsic rewards and discounted future extrinsic returns. This formulation enables task adaptive exploration incentives while preserving computational efficiency and training stability. We evaluate ACWI on a suite of sparse reward environments in MiniGrid. Experimental results demonstrate that ACWI consistently improves sample efficiency and learning stability compared to fixed intrinsic reward baselines, achieving superior performance with minimal computational overhead.
Paper Structure (19 sections, 16 equations, 6 figures, 1 table, 1 algorithm)

This paper contains 19 sections, 16 equations, 6 figures, 1 table, 1 algorithm.

Figures (6)

  • Figure 1: Overview of ACWI. Given state $s_t$, action $a_t$, and next state $s_{t+1}$ in environment $E$, the Beta Network produces a state-dependent scaling factor $\beta(s_t)$. The extrinsic reward $R^{E}_{t}$ and intrinsic reward $R^{I}_{t}$ are computed from the transition $(s_t, a_t, s_{t+1})$. The combined signal $R^{E}_{t} + \alpha\beta(s_t) R^{I}_{t}$ is used to update the policy $\pi$, where $\alpha$ is a global coefficient controlling the overall magnitude of the intrinsic reward relative to the extrinsic reward.
  • Figure 2: Screenshots of the five MiniGrid environments used in our experiments chevalier2018minigrid. From left to right: DoorKey-8x8, Empty-16x16, KeyCorridorS3R3, UnlockPickup, and RedBlueDoors-8x8. Each environment presents increasing complexity in terms of required exploration depth and compositional reasoning.
  • Figure 3: Episode returns over training steps across five MiniGrid environments chevalier2018minigrid. Comparison between the adaptive beta mechanism (ACWI), ICM with fixed intrinsic scaling coefficients $\beta \in \{0.1,0.2,0.5,1,2\}$, and the PPO baseline.
  • Figure 4: Evolution of the learned state-dependent adaptive $\beta$ distributions during training across three MiniGrid environments. DoorKey-8x8 and RedBlueDoors-8x8 progressively develop structured and multimodal distributions, whereas Empty-16x16 maintains a relatively narrow distribution throughout training, indicating limited adaptation under extreme reward sparsity.
  • Figure 5: Principal component projections of learned state representations colored by adaptive $\beta$. In DoorKey-8x8 and RedBlueDoors-8x8, $\beta$ aligns with task-relevant regions of the state space. In Empty-16x16, $\beta$ shows no systematic relationship with the representation geometry.
  • ...and 1 more figures