Table of Contents
Fetching ...

Reward-Aware Proto-Representations in Reinforcement Learning

Hon Tik Tse, Siddarth Chandrasekar, Marlos C. Machado

TL;DR

The default representation (DR), a recently proposed representation with limited theoretical (and empirical) analysis, is studied and shows that, compared to the SR, the DR gives rise to qualitatively different, reward-aware behaviour and quantitatively better performance in several settings.

Abstract

In recent years, the successor representation (SR) has attracted increasing attention in reinforcement learning (RL), and it has been used to address some of its key challenges, such as exploration, credit assignment, and generalization. The SR can be seen as representing the underlying credit assignment structure of the environment by implicitly encoding its induced transition dynamics. However, the SR is reward-agnostic. In this paper, we discuss a similar representation that also takes into account the reward dynamics of the problem. We study the default representation (DR), a recently proposed representation with limited theoretical (and empirical) analysis. Here, we lay some of the theoretical foundation underlying the DR in the tabular case by (1) deriving dynamic programming and (2) temporal-difference methods to learn the DR, (3) characterizing the basis for the vector space of the DR, and (4) formally extending the DR to the function approximation case through default features. Empirically, we analyze the benefits of the DR in many of the settings in which the SR has been applied, including (1) reward shaping, (2) option discovery, (3) exploration, and (4) transfer learning. Our results show that, compared to the SR, the DR gives rise to qualitatively different, reward-aware behaviour and quantitatively better performance in several settings.

Reward-Aware Proto-Representations in Reinforcement Learning

TL;DR

The default representation (DR), a recently proposed representation with limited theoretical (and empirical) analysis, is studied and shows that, compared to the SR, the DR gives rise to qualitatively different, reward-aware behaviour and quantitatively better performance in several settings.

Abstract

In recent years, the successor representation (SR) has attracted increasing attention in reinforcement learning (RL), and it has been used to address some of its key challenges, such as exploration, credit assignment, and generalization. The SR can be seen as representing the underlying credit assignment structure of the environment by implicitly encoding its induced transition dynamics. However, the SR is reward-agnostic. In this paper, we discuss a similar representation that also takes into account the reward dynamics of the problem. We study the default representation (DR), a recently proposed representation with limited theoretical (and empirical) analysis. Here, we lay some of the theoretical foundation underlying the DR in the tabular case by (1) deriving dynamic programming and (2) temporal-difference methods to learn the DR, (3) characterizing the basis for the vector space of the DR, and (4) formally extending the DR to the function approximation case through default features. Empirically, we analyze the benefits of the DR in many of the settings in which the SR has been applied, including (1) reward shaping, (2) option discovery, (3) exploration, and (4) transfer learning. Our results show that, compared to the SR, the DR gives rise to qualitatively different, reward-aware behaviour and quantitatively better performance in several settings.

Paper Structure

This paper contains 37 sections, 6 theorems, 48 equations, 11 figures, 3 tables, 1 algorithm.

Key Result

Theorem 3.1

Suppose both the SR and DR are computed with respect to the same policy, i.e., $\pi = \pi_d$. When the reward function is constant, i.e., $r(s) = r(s') \ \forall s, s' \in \mathscr{S}$, the $i$-th eigenvectors of the SR and DR are equivalent, and the $i$-th eigenvalues of the SR ($\lambda_{\text{SR} where $\gamma$ is the discount factor of the SR, $r(s)$ is the state reward, and $\lambda$ is the r

Figures (11)

  • Figure 1: Episodic envs. adapted to incorporate negative rewards. Clockwise: 1) grid taskdayan1993improving, 2) four roomssutton1999between, 3) grid roomwang2021towards, and 4) grid mazewang2021towards. Start state is in blue. The agent receives $-1$ reward at every time step unless it steps on red tiles ($-20$ reward) or reaches the goal in green ($0$ reward).
  • Figure 2: Top eigenvectors of the SR and the DR in the environments shown in Figure \ref{['fig:environments']}. We report the logarithm of the DR for better visualization due to very different magnitudes.
  • Figure 3: The avg. undiscounted return over 50 runs for potential-based reward shaping using the DR (DR-pot), the SR (SR-pot), the prior approach using the SR (SR-prior) wang2021towards, and no shaping (ns) in the environments shown in Figure \ref{['fig:environments']}. The shaded area indicates 95% confidence interval.
  • Figure 4: Top row: Average reward vs. state visitation percentage for various hyperparameter settings of iterative online eigenoption discovery via CEO, RACE, and random walk (RW) in the environments from Figure \ref{['fig:environments']}. For reference, solid dots mark settings with the highest visitation. Bottom row: Undiscounted return of RACE+Q, CEO+Q, and QL (baseline), averaged over 50 seeds. Rightmost environments are shown in Figure \ref{['fig:env_larger']}. Shaded areas indicate 95% confidence intervals.
  • Figure 5: Left: Four rooms with multiple goals. Right: Cumulative return across new terminal reward configurations . Curves are averaged over 50 runs. The shaded area shows 95% conf. interval.
  • ...and 6 more figures

Theorems & Definitions (13)

  • Theorem 3.1
  • Theorem 4.1
  • Definition 5.1
  • proof
  • proof
  • Theorem 3.1
  • proof
  • Theorem 4.1
  • proof
  • Proposition C.2
  • ...and 3 more