Behavioral Entropy-Guided Dataset Generation for Offline Reinforcement Learning
Wesley A. Suttle, Aamodh Suresh, Carlos Nieto-Granda
TL;DR
This work extends Behavioral Entropy (BE) to continuous spaces by defining a differential BE objective $H^{B,\alpha,\beta}$ via Prelec probability weighting and develops $k$-NN estimators with convergence guarantees for BE in high-dimensional settings. It then derives a practical BE-based reinforcement learning objective and a stable reward function to maximize BE, enabling BE-driven data generation for offline RL. Empirical results in MuJoCo environments show BE-generated datasets yield superior offline RL performance across multiple downstream tasks and algorithms, while exhibiting greater stability than Rényi-based methods. Overall, BE provides a principled and flexible exploration objective for generating diverse, informative datasets for offline RL, with notable improvements in data- and sample-efficiency and broader state-space coverage.
Abstract
Entropy-based objectives are widely used to perform state space exploration in reinforcement learning (RL) and dataset generation for offline RL. Behavioral entropy (BE), a rigorous generalization of classical entropies that incorporates cognitive and perceptual biases of agents, was recently proposed for discrete settings and shown to be a promising metric for robotic exploration problems. In this work, we propose using BE as a principled exploration objective for systematically generating datasets that provide diverse state space coverage in complex, continuous, potentially high-dimensional domains. To achieve this, we extend the notion of BE to continuous settings, derive tractable $k$-nearest neighbor estimators, provide theoretical guarantees for these estimators, and develop practical reward functions that can be used with standard RL methods to learn BE-maximizing policies. Using standard MuJoCo environments, we experimentally compare the performance of offline RL algorithms for a variety of downstream tasks on datasets generated using BE, Rényi, and Shannon entropy-maximizing policies, as well as the SMM and RND algorithms. We find that offline RL algorithms trained on datasets collected using BE outperform those trained on datasets collected using Shannon entropy, SMM, and RND on all tasks considered, and on 80% of the tasks compared to datasets collected using Rényi entropy.
