Task-Aware Exploration via a Predictive Bisimulation Metric

Dayang Liang; Ruihan Liu; Lipeng Wan; Yunlong Liu; Bo An

Task-Aware Exploration via a Predictive Bisimulation Metric

Dayang Liang, Ruihan Liu, Lipeng Wan, Yunlong Liu, Bo An

TL;DR

TEB is presented, a Task-aware Exploration approach that tightly couples task-relevant representations with exploration through a predictive Bisimulation metric and leverages the metric not only to learn behaviorally grounded task representations but also to measure behaviorally intrinsic novelty over the learned latent space.

Abstract

Accelerating exploration in visual reinforcement learning under sparse rewards remains challenging due to the substantial task-irrelevant variations. Despite advances in intrinsic exploration, many methods either assume access to low-dimensional states or lack task-aware exploration strategies, thereby rendering them fragile in visual domains. To bridge this gap, we present TEB, a Task-aware Exploration approach that tightly couples task-relevant representations with exploration through a predictive Bisimulation metric. Specifically, TEB leverages the metric not only to learn behaviorally grounded task representations but also to measure behaviorally intrinsic novelty over the learned latent space. To realize this, we first theoretically mitigate the representation collapse of degenerate bisimulation metrics under sparse rewards by internally introducing a simple but effective predicted reward differential. Building on this robust metric, we design potential-based exploration bonuses, which measure the relative novelty of adjacent observations over the latent space. Extensive experiments on MetaWorld and Maze2D show that TEB achieves superior exploration ability and outperforms recent baselines.

Task-Aware Exploration via a Predictive Bisimulation Metric

TL;DR

Abstract

Paper Structure (34 sections, 10 theorems, 48 equations, 8 figures, 3 tables, 1 algorithm)

This paper contains 34 sections, 10 theorems, 48 equations, 8 figures, 3 tables, 1 algorithm.

Introduction
Preliminaries
Bisimulation Metrics
Method
The Risk of Bisimulation Metrics
The Proposed Predictive Bisimulation Metric
Metric-based Intrinsic Exploration
Experiment
Experimental Setup
Environments.
MetaWorld Experiments
Reward-free Maze2D Experiments
Ablation Studies
Effectiveness of Predicted Gaussian Rewards
Related Work
...and 19 more sections

Key Result

Theorem 2.1

Given a MDP $\mathcal{M}$ and a fixed policy $\pi$, the following on-policy bisimulation metric exists and is unique: where the $r^\pi_{s}=\mathbb{E}_{a\sim\pi}[r(s,a)]$, $\mathcal{P}_{s_i}^\pi=\mathbb{E}_{a\sim\pi}\mathcal{P}^\pi(\cdot| s_i)$ and $\mathcal{W}_1$ is the 1-Wasserstein distance. $c_R>0$ and $c_T\in[0,1)$ are the coefficients of the reward and transition terms of the metric, respect

Figures (8)

Figure 1: An illustration of existing explorations in noise space and our task-aware exploration. In complex tasks, existing explorations remain limited by task-irrelevant elements, resulting in risky transitions like the "yellow line". In contrast, the task-relevant space and exploration bonus built on the bisimulation metric can help visual RL complete tasks faster. The right figure specifically illustrates how to construct an exploration bonus that measures task-relevant novelty between a state relative to a global anchor state using the unified metric in a rigorous task space.
Figure 2: Success rates of TEB and baselines in the MetaWorld environment. Each experiment runs with three random seeds, with shaded region representing the standard deviation across seeds.
Figure 3: Visualization of state coverage in representative Maze2D tasks.
Figure 4: Ablation Studies. Left: the three figures show the ablation curves of TEB components across three tasks; Right: the fourth figure illustrates the ablation results regarding anchor state selection for intrinsic rewards across the three tasks. Each task is run for 1M steps with three random seeds. Note that for better performance comparison, the anchor ablation experiment (right) exhibits the episode success rate at 50% of the training steps.
Figure 5: Performance analysis of the bisimulation metric with different reward signals.
...and 3 more figures

Theorems & Definitions (14)

Theorem 2.1: $\pi$-bisimulation metric castro2020scalable
Lemma 3.1: Diameter of $\mathcal{S}$ is bounded kemertas2021towards
Lemma 3.2: Convergence and Fixed Point
Theorem 3.3: Predictive Reward Prevents Degenerate Metric
Theorem 3.4: Value Difference Bound
Theorem 3.5: Policy Invariance of Metric-based Shaping
Lemma 1.1: Convergence and Fixed Point
proof
Theorem 1.2: Predictive Reward Prevents Degenerate Metric
proof
...and 4 more

Task-Aware Exploration via a Predictive Bisimulation Metric

TL;DR

Abstract

Task-Aware Exploration via a Predictive Bisimulation Metric

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (14)