Table of Contents
Fetching ...

Counterfactual experience augmented off-policy reinforcement learning

Sunbowen Lee, Yicheng Gong, Chao Deng

TL;DR

This paper tackles OOD and exploration inefficiencies in off-policy RL by introducing Counterfactual Experience Augmentation (CEA), which combines a State Transition Autoencoder (STA) with counterfactual action sampling to synthetically augment experience while preserving reward signals through bisimulation-based matching. STA uses a CVAE to model stochastic state transitions via a latent variable, enabling counterfactual inference conditioned on untried actions; counterfactual actions are generated using maximum-entropy Gaussian kernel density estimation and propagated through the STA to form complete counterfactual experiences. These virtual experiences are paired with real samples via a closest-transition approach to assign rewards, and added to the replay pool with prioritized sampling to improve data efficiency. The method is demonstrated on SUMO and Highway (discrete actions) and Pendulum and Lunar Lander (continuous actions), showing strong or competitive performance and offering a pathway to alleviating the OOD problem in diverse RL tasks. Limitations include computational cost and the need for refined bisimulation assumptions, with future work proposed on adaptive augmentation and broader continuous-action applicability.

Abstract

Reinforcement learning control algorithms face significant challenges due to out-of-distribution and inefficient exploration problems. While model-based reinforcement learning enhances the agent's reasoning and planning capabilities by constructing virtual environments, training such virtual environments can be very complex. In order to build an efficient inference model and enhance the representativeness of learning data, we propose the Counterfactual Experience Augmentation (CEA) algorithm. CEA leverages variational autoencoders to model the dynamic patterns of state transitions and introduces randomness to model non-stationarity. This approach focuses on expanding the learning data in the experience pool through counterfactual inference and performs exceptionally well in environments that follow the bisimulation assumption. Environments with bisimulation properties are usually represented by discrete observation and action spaces, we propose a sampling method based on maximum kernel density estimation entropy to extend CEA to various environments. By providing reward signals for counterfactual state transitions based on real information, CEA constructs a complete counterfactual experience to alleviate the out-of-distribution problem of the learning data, and outperforms general SOTA algorithms in environments with difference properties. Finally, we discuss the similarities, differences and properties of generated counterfactual experiences and real experiences. The code is available at https://github.com/Aegis1863/CEA.

Counterfactual experience augmented off-policy reinforcement learning

TL;DR

This paper tackles OOD and exploration inefficiencies in off-policy RL by introducing Counterfactual Experience Augmentation (CEA), which combines a State Transition Autoencoder (STA) with counterfactual action sampling to synthetically augment experience while preserving reward signals through bisimulation-based matching. STA uses a CVAE to model stochastic state transitions via a latent variable, enabling counterfactual inference conditioned on untried actions; counterfactual actions are generated using maximum-entropy Gaussian kernel density estimation and propagated through the STA to form complete counterfactual experiences. These virtual experiences are paired with real samples via a closest-transition approach to assign rewards, and added to the replay pool with prioritized sampling to improve data efficiency. The method is demonstrated on SUMO and Highway (discrete actions) and Pendulum and Lunar Lander (continuous actions), showing strong or competitive performance and offering a pathway to alleviating the OOD problem in diverse RL tasks. Limitations include computational cost and the need for refined bisimulation assumptions, with future work proposed on adaptive augmentation and broader continuous-action applicability.

Abstract

Reinforcement learning control algorithms face significant challenges due to out-of-distribution and inefficient exploration problems. While model-based reinforcement learning enhances the agent's reasoning and planning capabilities by constructing virtual environments, training such virtual environments can be very complex. In order to build an efficient inference model and enhance the representativeness of learning data, we propose the Counterfactual Experience Augmentation (CEA) algorithm. CEA leverages variational autoencoders to model the dynamic patterns of state transitions and introduces randomness to model non-stationarity. This approach focuses on expanding the learning data in the experience pool through counterfactual inference and performs exceptionally well in environments that follow the bisimulation assumption. Environments with bisimulation properties are usually represented by discrete observation and action spaces, we propose a sampling method based on maximum kernel density estimation entropy to extend CEA to various environments. By providing reward signals for counterfactual state transitions based on real information, CEA constructs a complete counterfactual experience to alleviate the out-of-distribution problem of the learning data, and outperforms general SOTA algorithms in environments with difference properties. Finally, we discuss the similarities, differences and properties of generated counterfactual experiences and real experiences. The code is available at https://github.com/Aegis1863/CEA.

Paper Structure

This paper contains 28 sections, 31 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: Counterfactual inference learning. The agent's task is to find the treasure chest. When the agent moves to a state with the same coordinate as the treasure chest, the reward increases by 1, otherwise remains 0. We marked the entire process with serial numbers in the figure. In the first and second steps, the agent randomly selects its action and records the process of position transfer. This record can be used for inference learning, with its ability referring to table "Inference ability". Once this inference ability is obtained, it can be used to infer the results of the corresponding actions in new situations like step four. The agent can compare the predicted results with historical information, as shown in step six, thereby choosing the best action without actually executing the action. The action space and state space of the environment are both discrete, which satisfies the bisimulation assumption.
  • Figure 2: Causal directed acyclic graph of state transition process. In this case, different state-action pairs will lead to the same result, satisfying the bisimulation property.
  • Figure 3: Sampling based on kernel density estimation
  • Figure 4: MDP with counterfactual actions
  • Figure 5: CEA framework
  • ...and 5 more figures