Variable-Agnostic Causal Exploration for Reinforcement Learning

Minh Hoang Nguyen; Hung Le; Svetha Venkatesh

Variable-Agnostic Causal Exploration for Reinforcement Learning

Minh Hoang Nguyen, Hung Le, Svetha Venkatesh

TL;DR

A novel framework, Variable-Agnostic Causal Exploration for Reinforcement Learning (VACERL) incorporating causal relationships to drive exploration in RL without specifying environmental causal variables, which shows a significant improvement in agent performance in grid-world, 2d games and robotic domains.

Abstract

Modern reinforcement learning (RL) struggles to capture real-world cause-and-effect dynamics, leading to inefficient exploration due to extensive trial-and-error actions. While recent efforts to improve agent exploration have leveraged causal discovery, they often make unrealistic assumptions of causal variables in the environments. In this paper, we introduce a novel framework, Variable-Agnostic Causal Exploration for Reinforcement Learning (VACERL), incorporating causal relationships to drive exploration in RL without specifying environmental causal variables. Our approach automatically identifies crucial observation-action steps associated with key variables using attention mechanisms. Subsequently, it constructs the causal graph connecting these steps, which guides the agent towards observation-action pairs with greater causal influence on task completion. This can be leveraged to generate intrinsic rewards or establish a hierarchy of subgoals to enhance exploration efficiency. Experimental results showcase a significant improvement in agent performance in grid-world, 2d games and robotic domains, particularly in scenarios with sparse rewards and noisy actions, such as the notorious Noisy-TV environments.

Variable-Agnostic Causal Exploration for Reinforcement Learning

TL;DR

Abstract

Paper Structure (28 sections, 8 equations, 12 figures, 6 tables, 3 algorithms)

This paper contains 28 sections, 8 equations, 12 figures, 6 tables, 3 algorithms.

Introduction
Related Work
Methods
Background
Variable-Agnostic Causal Exploration Reinforcement Learning Framework
Overview.
Phase 1: Crucial Step Detection.
Phase 2: Causal Structure Discovery.
Phase 3: Agent Training with Causal Information.
Experiments
VACERL: Causal Intrinsic Rewards - Implementation and Evaluation
VACERL: Causal Subgoals - Implementation and Evaluation
Ablation Study and Model Analysis
Conclusion
Details of Methodology
...and 13 more sections

Figures (12)

Figure 1: (a): A causally structured environment (MG-2): the agent, starting in the left room is given a +1 reward when it picks up the blue box located in the right room. (b): A possible causal the graph represents the ideal causality steps for the environment in (a): pick up and then drop the ball in another position; pick up the key to open the door; and ultimately pick up the target goal. (c): VACERL framework. During the process of the agent training (initially, a random policy is used) in the environment, we extracted any successful trajectories and filled buffer $B$. The Transformer ($TF$) model, trained using $B$, takes input $\{o_{t}^{k},a_{t}^{k}\}_{t=1}^{T^{k}-1}$to predict$(\hat{o}_{T}^{k},\hat{a}_{T}^{k})$. $TF$'s attention score ($a_{s}$) is used to determine the set $S_{COAS}$ and buffer $B^{*}$. Parameters $\delta$ and $\eta$ of the $SCM$ are optimized using $B^{*}$. We extract a causal hierarchy tree from the causal graph and use it to design the approaches employed for agent training.
Figure 2: The learning curves for MG-2 illustrate the average return (mean$\pm$std over 3 runs) of 50 testing episodes over 2 million training steps. For VACERL and causal baselines, the learning curves include 300,000 steps dedicated to the initial random exploration period to collect initial successful trajectories, with rewards for each step set to 0. The causal graph is reconstructed every 300,000 steps.
Figure 3: (a, b): Learning curves of FetchReach (a) (% represents the portion of random subgoals replaced) and FetchPickAndPlace (b). The average return of 50 testing episodes (mean$\pm$std over 3 runs). For VACERL, successful trajectories are collected during training and causal graphs are reconstructed every 2,000 episodes and 10,000 episodes, respectively. (c): Component contribution study on MG-2 task. The average return of 50 episodes (mean$\pm$std. over 3 runs).
Figure 4: (a,b): Attention heatmap when $B$ has 4 (a) and 40 (b) trajectories for MG-2 task. We highlight the top-8 attended actions and their grids. A big cell (black boundary) represents a grid in the map, containing smaller cells representing 6 possible actions. (c): Results tuning the number of steps ($M$) in $S_{COAS}$. Average return of 50 episodes (mean$\pm$std. over 3 runs).
Figure 5: Causal graph that was generated as a result of the algorithm for MG-2. The RGB images display the agent's viewports during the execution of the corresponding actions.
...and 7 more figures

Variable-Agnostic Causal Exploration for Reinforcement Learning

TL;DR

Abstract

Variable-Agnostic Causal Exploration for Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (12)