Table of Contents
Fetching ...

Behavioral Entropy-Guided Dataset Generation for Offline Reinforcement Learning

Wesley A. Suttle, Aamodh Suresh, Carlos Nieto-Granda

TL;DR

This work extends Behavioral Entropy (BE) to continuous spaces by defining a differential BE objective $H^{B,\alpha,\beta}$ via Prelec probability weighting and develops $k$-NN estimators with convergence guarantees for BE in high-dimensional settings. It then derives a practical BE-based reinforcement learning objective and a stable reward function to maximize BE, enabling BE-driven data generation for offline RL. Empirical results in MuJoCo environments show BE-generated datasets yield superior offline RL performance across multiple downstream tasks and algorithms, while exhibiting greater stability than Rényi-based methods. Overall, BE provides a principled and flexible exploration objective for generating diverse, informative datasets for offline RL, with notable improvements in data- and sample-efficiency and broader state-space coverage.

Abstract

Entropy-based objectives are widely used to perform state space exploration in reinforcement learning (RL) and dataset generation for offline RL. Behavioral entropy (BE), a rigorous generalization of classical entropies that incorporates cognitive and perceptual biases of agents, was recently proposed for discrete settings and shown to be a promising metric for robotic exploration problems. In this work, we propose using BE as a principled exploration objective for systematically generating datasets that provide diverse state space coverage in complex, continuous, potentially high-dimensional domains. To achieve this, we extend the notion of BE to continuous settings, derive tractable $k$-nearest neighbor estimators, provide theoretical guarantees for these estimators, and develop practical reward functions that can be used with standard RL methods to learn BE-maximizing policies. Using standard MuJoCo environments, we experimentally compare the performance of offline RL algorithms for a variety of downstream tasks on datasets generated using BE, Rényi, and Shannon entropy-maximizing policies, as well as the SMM and RND algorithms. We find that offline RL algorithms trained on datasets collected using BE outperform those trained on datasets collected using Shannon entropy, SMM, and RND on all tasks considered, and on 80% of the tasks compared to datasets collected using Rényi entropy.

Behavioral Entropy-Guided Dataset Generation for Offline Reinforcement Learning

TL;DR

This work extends Behavioral Entropy (BE) to continuous spaces by defining a differential BE objective via Prelec probability weighting and develops -NN estimators with convergence guarantees for BE in high-dimensional settings. It then derives a practical BE-based reinforcement learning objective and a stable reward function to maximize BE, enabling BE-driven data generation for offline RL. Empirical results in MuJoCo environments show BE-generated datasets yield superior offline RL performance across multiple downstream tasks and algorithms, while exhibiting greater stability than Rényi-based methods. Overall, BE provides a principled and flexible exploration objective for generating diverse, informative datasets for offline RL, with notable improvements in data- and sample-efficiency and broader state-space coverage.

Abstract

Entropy-based objectives are widely used to perform state space exploration in reinforcement learning (RL) and dataset generation for offline RL. Behavioral entropy (BE), a rigorous generalization of classical entropies that incorporates cognitive and perceptual biases of agents, was recently proposed for discrete settings and shown to be a promising metric for robotic exploration problems. In this work, we propose using BE as a principled exploration objective for systematically generating datasets that provide diverse state space coverage in complex, continuous, potentially high-dimensional domains. To achieve this, we extend the notion of BE to continuous settings, derive tractable -nearest neighbor estimators, provide theoretical guarantees for these estimators, and develop practical reward functions that can be used with standard RL methods to learn BE-maximizing policies. Using standard MuJoCo environments, we experimentally compare the performance of offline RL algorithms for a variety of downstream tasks on datasets generated using BE, Rényi, and Shannon entropy-maximizing policies, as well as the SMM and RND algorithms. We find that offline RL algorithms trained on datasets collected using BE outperform those trained on datasets collected using Shannon entropy, SMM, and RND on all tasks considered, and on 80% of the tasks compared to datasets collected using Rényi entropy.

Paper Structure

This paper contains 13 sections, 5 theorems, 28 equations, 13 figures, 3 tables.

Key Result

Theorem 1

Suppose that $k := k_n \rightarrow \infty, \frac{k_n}{n} \rightarrow 0$, and $\frac{k_n}{\log n} \rightarrow \infty$ as $n \rightarrow \infty$. Assume that $w$ is Lipschitz, that $f$ is absolutely continuous, and that there exist $c_1, c_2 > 0$ such that $0 < c_1 \leq f(x) \leq c_2 < \infty$, for al

Figures (13)

  • Figure 1: (Left) Comparison of Shannon entropy, Rényi entropy, and behavioral entropy (ours) and their effects on dataset generation, shown in PHATE plots, when used as an exploration objective. (Right) Performance comparison of an offline RL algorithm (CQL) for three downstream tasks on datasets generated using Shannon, behavioral entropy (ours), and Rényi entropy for the parameter $q = 1.1$ shown in the left-hand figure.
  • Figure 2: Visualizations of probability weightings (left) and superior expressiveness of BE (right).
  • Figure 3: PHATE plots for Walker tasks.
  • Figure 4: Comparison of offline RL performance over the entropy objectives used in dataset generation. Plots show mean and standard deviation over five seeds. Dotted line shows performance of RL policy trained online until approximate optimality.
  • Figure 5: Offline RL results for all $\alpha$ and $q$ values evaluated. Initial trials showed $q \in \{2.0, 3.0, 5.0\}$ led to performance no better (and usually worse) than $q = 1.1$, so offline RL training for these $q$ values was not performed.
  • ...and 8 more figures

Theorems & Definitions (9)

  • Definition 1
  • Definition 2
  • Definition 3
  • Theorem 1
  • Theorem 2
  • Lemma 1: singh2016finite
  • Corollary 1
  • Lemma 2: zhao2022analysis
  • proof