Table of Contents
Fetching ...

Accelerating Reinforcement Learning with Value-Conditional State Entropy Exploration

Dongyoung Kim, Jinwoo Shin, Pieter Abbeel, Younggyo Seo

TL;DR

The paper tackles exploration in reinforcement learning under supervised rewards by addressing the imbalance that arises when high-value states are rare. It introduces value-conditional state entropy (VCSE), a method that partitions the visited state space by value estimates and estimates the entropy within each partition using a Kraskov–Stögbauer–Grassberger (KSG) estimator, yielding an intrinsic reward $r^{\tt VCSE}$ that focuses exploration on structurally useful regions. VCSE is integrated with standard RL algorithms to maximize the total return $r^T_t = r^e_t + \beta \cdot r^{VCSE}_t$, with value normalization and replay-buffer-based estimation ensuring stable learning. Across MiniGrid, DeepMind Control Suite, and Meta-World, VCSE consistently accelerates learning and improves sample efficiency compared to state-entropy baselines, often solving tasks that SE cannot reach, highlighting its practical impact for efficient exploration in diverse domains. The work demonstrates strong empirical gains and provides a robust, parameter-tolerant alternative to entropy-based exploration in supervised RL settings.

Abstract

A promising technique for exploration is to maximize the entropy of visited state distribution, i.e., state entropy, by encouraging uniform coverage of visited state space. While it has been effective for an unsupervised setup, it tends to struggle in a supervised setup with a task reward, where an agent prefers to visit high-value states to exploit the task reward. Such a preference can cause an imbalance between the distributions of high-value states and low-value states, which biases exploration towards low-value state regions as a result of the state entropy increasing when the distribution becomes more uniform. This issue is exacerbated when high-value states are narrowly distributed within the state space, making it difficult for the agent to complete the tasks. In this paper, we present a novel exploration technique that maximizes the value-conditional state entropy, which separately estimates the state entropies that are conditioned on the value estimates of each state, then maximizes their average. By only considering the visited states with similar value estimates for computing the intrinsic bonus, our method prevents the distribution of low-value states from affecting exploration around high-value states, and vice versa. We demonstrate that the proposed alternative to the state entropy baseline significantly accelerates various reinforcement learning algorithms across a variety of tasks within MiniGrid, DeepMind Control Suite, and Meta-World benchmarks. Source code is available at https://sites.google.com/view/rl-vcse.

Accelerating Reinforcement Learning with Value-Conditional State Entropy Exploration

TL;DR

The paper tackles exploration in reinforcement learning under supervised rewards by addressing the imbalance that arises when high-value states are rare. It introduces value-conditional state entropy (VCSE), a method that partitions the visited state space by value estimates and estimates the entropy within each partition using a Kraskov–Stögbauer–Grassberger (KSG) estimator, yielding an intrinsic reward that focuses exploration on structurally useful regions. VCSE is integrated with standard RL algorithms to maximize the total return , with value normalization and replay-buffer-based estimation ensuring stable learning. Across MiniGrid, DeepMind Control Suite, and Meta-World, VCSE consistently accelerates learning and improves sample efficiency compared to state-entropy baselines, often solving tasks that SE cannot reach, highlighting its practical impact for efficient exploration in diverse domains. The work demonstrates strong empirical gains and provides a robust, parameter-tolerant alternative to entropy-based exploration in supervised RL settings.

Abstract

A promising technique for exploration is to maximize the entropy of visited state distribution, i.e., state entropy, by encouraging uniform coverage of visited state space. While it has been effective for an unsupervised setup, it tends to struggle in a supervised setup with a task reward, where an agent prefers to visit high-value states to exploit the task reward. Such a preference can cause an imbalance between the distributions of high-value states and low-value states, which biases exploration towards low-value state regions as a result of the state entropy increasing when the distribution becomes more uniform. This issue is exacerbated when high-value states are narrowly distributed within the state space, making it difficult for the agent to complete the tasks. In this paper, we present a novel exploration technique that maximizes the value-conditional state entropy, which separately estimates the state entropies that are conditioned on the value estimates of each state, then maximizes their average. By only considering the visited states with similar value estimates for computing the intrinsic bonus, our method prevents the distribution of low-value states from affecting exploration around high-value states, and vice versa. We demonstrate that the proposed alternative to the state entropy baseline significantly accelerates various reinforcement learning algorithms across a variety of tasks within MiniGrid, DeepMind Control Suite, and Meta-World benchmarks. Source code is available at https://sites.google.com/view/rl-vcse.
Paper Structure (65 sections, 7 equations, 20 figures)

This paper contains 65 sections, 7 equations, 20 figures.

Figures (20)

  • Figure 1: Illustration of our method. We randomly sample states from a replay buffer and compute the Euclidean norm in state and value spaces using pairs of samples within a minibatch. We then sort the samples based on their maximum norms. We find the $k$-th nearest neighbor among samples (e.g.,$k=3$ in the figure) and use the distance to it as an intrinsic reward. Namely, our method excludes the samples whose values significantly differ for computing the intrinsic reward. Then we train our RL agent to maximize the sum of the intrinsic reward and the extrinsic reward.
  • Figure 2: Learning curves on six navigation tasks from MiniGrid gym_minigrid as measured on the success rate. The solid line and shaded regions represent the interquartile mean and standard deviation, respectively, across 16 runs.
  • Figure 3: Examples of tasks from MiniGrid, DeepMind Control Suite, and Meta-World.
  • Figure 4: Learning curves on SimpleCrossingS9N1 as measured on the success rate. The solid line and shaded regions represent the interquartile mean and standard deviation, respectively, across eight runs.
  • Figure 5: Learning curves on six control tasks from DeepMind Control Suite tassa2020dm_control as measured on the episode return. The solid line and shaded regions represent the interquartile mean and standard deviation, respectively, across 16 runs.
  • ...and 15 more figures