Table of Contents
Fetching ...

AdaCred: Adaptive Causal Decision Transformers with Feature Crediting

Hemant Kumawat, Saibal Mukhopadhyay

TL;DR

AdaCred tackles offline reinforcement learning and imitation learning under long-sequence and suboptimal-data challenges by modeling trajectories as causal latent graphs and applying a feature crediting and pruning mechanism. It introduces Adaptive Causal Decision Transformers, combining a Spatial Transformer with a Temporal Causal Transformer to identify a minimal sufficient latent set $\mathbf{g}_t^{\min}$ for policy learning, with theoretical guarantees on identifiability under the Markov and faithfulness assumptions. The approach is supported by a two-stage training procedure and a sparsity regularizer, enabling efficient learning from short sequences while preserving performance. Empirically, AdaCred delivers superior or competitive results on Atari and Gym benchmarks in both offline RL and imitation learning, achieving peak performance with substantially shorter trajectories and improved computational efficiency.

Abstract

Reinforcement learning (RL) can be formulated as a sequence modeling problem, where models predict future actions based on historical state-action-reward sequences. Current approaches typically require long trajectory sequences to model the environment in offline RL settings. However, these models tend to over-rely on memorizing long-term representations, which impairs their ability to effectively attribute importance to trajectories and learned representations based on task-specific relevance. In this work, we introduce AdaCred, a novel approach that represents trajectories as causal graphs built from short-term action-reward-state sequences. Our model adaptively learns control policy by crediting and pruning low-importance representations, retaining only those most relevant for the downstream task. Our experiments demonstrate that AdaCred-based policies require shorter trajectory sequences and consistently outperform conventional methods in both offline reinforcement learning and imitation learning environments.

AdaCred: Adaptive Causal Decision Transformers with Feature Crediting

TL;DR

AdaCred tackles offline reinforcement learning and imitation learning under long-sequence and suboptimal-data challenges by modeling trajectories as causal latent graphs and applying a feature crediting and pruning mechanism. It introduces Adaptive Causal Decision Transformers, combining a Spatial Transformer with a Temporal Causal Transformer to identify a minimal sufficient latent set for policy learning, with theoretical guarantees on identifiability under the Markov and faithfulness assumptions. The approach is supported by a two-stage training procedure and a sparsity regularizer, enabling efficient learning from short sequences while preserving performance. Empirically, AdaCred delivers superior or competitive results on Atari and Gym benchmarks in both offline RL and imitation learning, achieving peak performance with substantially shorter trajectories and improved computational efficiency.

Abstract

Reinforcement learning (RL) can be formulated as a sequence modeling problem, where models predict future actions based on historical state-action-reward sequences. Current approaches typically require long trajectory sequences to model the environment in offline RL settings. However, these models tend to over-rely on memorizing long-term representations, which impairs their ability to effectively attribute importance to trajectories and learned representations based on task-specific relevance. In this work, we introduce AdaCred, a novel approach that represents trajectories as causal graphs built from short-term action-reward-state sequences. Our model adaptively learns control policy by crediting and pruning low-importance representations, retaining only those most relevant for the downstream task. Our experiments demonstrate that AdaCred-based policies require shorter trajectory sequences and consistently outperform conventional methods in both offline reinforcement learning and imitation learning environments.

Paper Structure

This paper contains 24 sections, 2 theorems, 8 equations, 7 figures, 4 tables, 1 algorithm.

Key Result

Theorem 1

Under the assumption that the causal graph $G$ is Markov and faithful to the observed data, the set of minimal latent states $\mathbf{g}_t^{\min} \subseteq \mathbf{g}_t$ is defined as: $g_{i,t} \in \mathbf{g}_t^{\min}$ if $c_{i}^{g \to r} = 1$ or $g_{i,t}$ has a directed path to a future reward thro

Figures (7)

  • Figure 1: POMDP view of RL
  • Figure 2: Our model design
  • Figure 3: a.) Spatial Transformer with Crediting. Output of spatial transformer is sent to the next spatial transformer. b.) Temporal Causal Transformer with Crediting
  • Figure 4: Performance for 75% Spatial and Temporal Crediting
  • Figure 5: Performance for 75% Spatial and Temporal Crediting for Offline RL
  • ...and 2 more figures

Theorems & Definitions (3)

  • Definition 1: Graphical Representation of Latent State Transitions
  • Theorem 1: Minimal Sufficient Representations for Policy Learning
  • Theorem 2: Structural Identifiability