Table of Contents
Fetching ...

Quasimetric Value Functions with Dense Rewards

Khadichabonu Valieva, Bikramjit Banerjee

TL;DR

It is shown that the key property underpinning a quasimetric, viz., the triangle inequality, is preserved under a dense reward setting as well, and the empirical results confirm that training a quasimetric value function in the dense reward setting indeed outperforms training with sparse rewards.

Abstract

As a generalization of reinforcement learning (RL) to parametrizable goals, goal conditioned RL (GCRL) has a broad range of applications, particularly in challenging tasks in robotics. Recent work has established that the optimal value function of GCRL $Q^\ast(s,a,g)$ has a quasimetric structure, leading to targetted neural architectures that respect such structure. However, the relevant analyses assume a sparse reward setting -- a known aggravating factor to sample complexity. We show that the key property underpinning a quasimetric, viz., the triangle inequality, is preserved under a dense reward setting as well. Contrary to earlier findings where dense rewards were shown to be detrimental to GCRL, we identify the key condition necessary for triangle inequality. Dense reward functions that satisfy this condition can only improve, never worsen, sample complexity. This opens up opportunities to train efficient neural architectures with dense rewards, compounding their benefits to sample complexity. We evaluate this proposal in 12 standard benchmark environments in GCRL featuring challenging continuous control tasks. Our empirical results confirm that training a quasimetric value function in our dense reward setting indeed outperforms training with sparse rewards.

Quasimetric Value Functions with Dense Rewards

TL;DR

It is shown that the key property underpinning a quasimetric, viz., the triangle inequality, is preserved under a dense reward setting as well, and the empirical results confirm that training a quasimetric value function in the dense reward setting indeed outperforms training with sparse rewards.

Abstract

As a generalization of reinforcement learning (RL) to parametrizable goals, goal conditioned RL (GCRL) has a broad range of applications, particularly in challenging tasks in robotics. Recent work has established that the optimal value function of GCRL has a quasimetric structure, leading to targetted neural architectures that respect such structure. However, the relevant analyses assume a sparse reward setting -- a known aggravating factor to sample complexity. We show that the key property underpinning a quasimetric, viz., the triangle inequality, is preserved under a dense reward setting as well. Contrary to earlier findings where dense rewards were shown to be detrimental to GCRL, we identify the key condition necessary for triangle inequality. Dense reward functions that satisfy this condition can only improve, never worsen, sample complexity. This opens up opportunities to train efficient neural architectures with dense rewards, compounding their benefits to sample complexity. We evaluate this proposal in 12 standard benchmark environments in GCRL featuring challenging continuous control tasks. Our empirical results confirm that training a quasimetric value function in our dense reward setting indeed outperforms training with sparse rewards.
Paper Structure (14 sections, 2 theorems, 27 equations, 2 figures)

This paper contains 14 sections, 2 theorems, 27 equations, 2 figures.

Key Result

Proposition 1

Consider the shaped, goal-conditioned MDP $M_{GCF}=({\cal S}, {\cal A}, {\cal G}, T, R+F, \gamma, \rho_0, \rho_g)$, with ${\cal G}\equiv {\cal S}\times{\cal A}$. The optimal universal value function $Q^\ast_F$ satisfies the triangle inequality: $\forall x^1, x^2, x^3\in {\cal X}$, The only condition $\phi$ must satisfy is w.r.t. the unshaped value function, for which a sufficient condition is es

Figures (2)

  • Figure 1: GCRL benchmark environments Plappert18:Multi-Goal. Figure from Liu23:Metric.
  • Figure 2: Comparison of MRN with sparse rewards vs. dense rewards. Learning curves are averaged over five independent trials, and one standard deviation bands are included. We see statistically significant improvement of performance due to dense rewards in 4 of the 12 environments, viz., FetchSlide, BlockFull, Eggfull and PenFull. There is no statistically significant deterioration in any environment.

Theorems & Definitions (3)

  • Proposition 1
  • Definition 1
  • Proposition 2