Counterfactual experience augmented off-policy reinforcement learning
Sunbowen Lee, Yicheng Gong, Chao Deng
TL;DR
This paper tackles OOD and exploration inefficiencies in off-policy RL by introducing Counterfactual Experience Augmentation (CEA), which combines a State Transition Autoencoder (STA) with counterfactual action sampling to synthetically augment experience while preserving reward signals through bisimulation-based matching. STA uses a CVAE to model stochastic state transitions via a latent variable, enabling counterfactual inference conditioned on untried actions; counterfactual actions are generated using maximum-entropy Gaussian kernel density estimation and propagated through the STA to form complete counterfactual experiences. These virtual experiences are paired with real samples via a closest-transition approach to assign rewards, and added to the replay pool with prioritized sampling to improve data efficiency. The method is demonstrated on SUMO and Highway (discrete actions) and Pendulum and Lunar Lander (continuous actions), showing strong or competitive performance and offering a pathway to alleviating the OOD problem in diverse RL tasks. Limitations include computational cost and the need for refined bisimulation assumptions, with future work proposed on adaptive augmentation and broader continuous-action applicability.
Abstract
Reinforcement learning control algorithms face significant challenges due to out-of-distribution and inefficient exploration problems. While model-based reinforcement learning enhances the agent's reasoning and planning capabilities by constructing virtual environments, training such virtual environments can be very complex. In order to build an efficient inference model and enhance the representativeness of learning data, we propose the Counterfactual Experience Augmentation (CEA) algorithm. CEA leverages variational autoencoders to model the dynamic patterns of state transitions and introduces randomness to model non-stationarity. This approach focuses on expanding the learning data in the experience pool through counterfactual inference and performs exceptionally well in environments that follow the bisimulation assumption. Environments with bisimulation properties are usually represented by discrete observation and action spaces, we propose a sampling method based on maximum kernel density estimation entropy to extend CEA to various environments. By providing reward signals for counterfactual state transitions based on real information, CEA constructs a complete counterfactual experience to alleviate the out-of-distribution problem of the learning data, and outperforms general SOTA algorithms in environments with difference properties. Finally, we discuss the similarities, differences and properties of generated counterfactual experiences and real experiences. The code is available at https://github.com/Aegis1863/CEA.
