Table of Contents
Fetching ...

TW-CRL: Time-Weighted Contrastive Reward Learning for Efficient Inverse Reinforcement Learning

Yuxuan Li, Yicheng Gao, Ning Yang, Stephen Xia

TL;DR

TW-CRL introduces a time-weighted, contrastive inverse reinforcement learning framework that learns dense rewards from both successful and failed demonstrations in episodic tasks with trap states. By modeling trajectory dynamics with an absorbing Markov chain and applying a time-weighted function, it assigns greater importance to later states that most influence success or failure. The Contrastive Reward Learning loss guides the reward network to differentiate goal- and trap-related states, enabling efficient exploration beyond imitation. Empirical results across navigation and robotic manipulation benchmarks show faster convergence and improved robustness compared to state-of-the-art IRL baselines, with demonstrated generalization to unseen goal configurations. This approach offers a principled way to leverage failures to avoid traps and accelerate learning in complex, sparse-reward environments.

Abstract

Episodic tasks in Reinforcement Learning (RL) often pose challenges due to sparse reward signals and high-dimensional state spaces, which hinder efficient learning. Additionally, these tasks often feature hidden "trap states" -- irreversible failures that prevent task completion but do not provide explicit negative rewards to guide agents away from repeated errors. To address these issues, we propose Time-Weighted Contrastive Reward Learning (TW-CRL), an Inverse Reinforcement Learning (IRL) framework that leverages both successful and failed demonstrations. By incorporating temporal information, TW-CRL learns a dense reward function that identifies critical states associated with success or failure. This approach not only enables agents to avoid trap states but also encourages meaningful exploration beyond simple imitation of expert trajectories. Empirical evaluations on navigation tasks and robotic manipulation benchmarks demonstrate that TW-CRL surpasses state-of-the-art methods, achieving improved efficiency and robustness.

TW-CRL: Time-Weighted Contrastive Reward Learning for Efficient Inverse Reinforcement Learning

TL;DR

TW-CRL introduces a time-weighted, contrastive inverse reinforcement learning framework that learns dense rewards from both successful and failed demonstrations in episodic tasks with trap states. By modeling trajectory dynamics with an absorbing Markov chain and applying a time-weighted function, it assigns greater importance to later states that most influence success or failure. The Contrastive Reward Learning loss guides the reward network to differentiate goal- and trap-related states, enabling efficient exploration beyond imitation. Empirical results across navigation and robotic manipulation benchmarks show faster convergence and improved robustness compared to state-of-the-art IRL baselines, with demonstrated generalization to unseen goal configurations. This approach offers a principled way to leverage failures to avoid traps and accelerate learning in complex, sparse-reward environments.

Abstract

Episodic tasks in Reinforcement Learning (RL) often pose challenges due to sparse reward signals and high-dimensional state spaces, which hinder efficient learning. Additionally, these tasks often feature hidden "trap states" -- irreversible failures that prevent task completion but do not provide explicit negative rewards to guide agents away from repeated errors. To address these issues, we propose Time-Weighted Contrastive Reward Learning (TW-CRL), an Inverse Reinforcement Learning (IRL) framework that leverages both successful and failed demonstrations. By incorporating temporal information, TW-CRL learns a dense reward function that identifies critical states associated with success or failure. This approach not only enables agents to avoid trap states but also encourages meaningful exploration beyond simple imitation of expert trajectories. Empirical evaluations on navigation tasks and robotic manipulation benchmarks demonstrate that TW-CRL surpasses state-of-the-art methods, achieving improved efficiency and robustness.

Paper Structure

This paper contains 32 sections, 22 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Overview of TW-CRL.
  • Figure 2: Training curves on benchmarks. The solid curves represent the mean, and the shaded regions indicate the standard deviation over five runs.
  • Figure 3: Visualization of the reward function in the TrapMaze-v1 environment for TW-CRL and baseline methods. Each column represents a different method, and each row shows a training stage, with the final row illustrating the fully trained reward functions.
  • Figure 4: Illustration of the map and potential trajectories in TrapMaze-v1.
  • Figure 5: Ablation studies in TrapMaze-v1 and TrapMaze-v2 environments. The left figures show the ablation of the Time-Weighted function, the right figures show the ablation of the Contrastive Reward Learning loss function.
  • ...and 5 more figures

Theorems & Definitions (1)

  • Definition 3.1: Trap state