Table of Contents
Fetching ...

The Value of Reward Lookahead in Reinforcement Learning

Nadav Merlis, Dorian Baudry, Vianney Perchet

TL;DR

This work measures the ratio between the value of standard RL agents and that of agents with partial future-reward lookahead and characterize the worst-case reward distribution and derive exact ratios for the worst-case reward expectations.

Abstract

In reinforcement learning (RL), agents sequentially interact with changing environments while aiming to maximize the obtained rewards. Usually, rewards are observed only after acting, and so the goal is to maximize the expected cumulative reward. Yet, in many practical settings, reward information is observed in advance -- prices are observed before performing transactions; nearby traffic information is partially known; and goals are oftentimes given to agents prior to the interaction. In this work, we aim to quantifiably analyze the value of such future reward information through the lens of competitive analysis. In particular, we measure the ratio between the value of standard RL agents and that of agents with partial future-reward lookahead. We characterize the worst-case reward distribution and derive exact ratios for the worst-case reward expectations. Surprisingly, the resulting ratios relate to known quantities in offline RL and reward-free exploration. We further provide tight bounds for the ratio given the worst-case dynamics. Our results cover the full spectrum between observing the immediate rewards before acting to observing all the rewards before the interaction starts.

The Value of Reward Lookahead in Reinforcement Learning

TL;DR

This work measures the ratio between the value of standard RL agents and that of agents with partial future-reward lookahead and characterize the worst-case reward distribution and derive exact ratios for the worst-case reward expectations.

Abstract

In reinforcement learning (RL), agents sequentially interact with changing environments while aiming to maximize the obtained rewards. Usually, rewards are observed only after acting, and so the goal is to maximize the expected cumulative reward. Yet, in many practical settings, reward information is observed in advance -- prices are observed before performing transactions; nearby traffic information is partially known; and goals are oftentimes given to agents prior to the interaction. In this work, we aim to quantifiably analyze the value of such future reward information through the lens of competitive analysis. In particular, we measure the ratio between the value of standard RL agents and that of agents with partial future-reward lookahead. We characterize the worst-case reward distribution and derive exact ratios for the worst-case reward expectations. Surprisingly, the resulting ratios relate to known quantities in offline RL and reward-free exploration. We further provide tight bounds for the ratio given the worst-case dynamics. Our results cover the full spectrum between observing the immediate rewards before acting to observing all the rewards before the interaction starts.
Paper Structure (15 sections, 10 theorems, 92 equations, 3 figures)

This paper contains 15 sections, 10 theorems, 92 equations, 3 figures.

Key Result

Theorem 1

[theorem]theorem:fullInfoCR[CR versus Full Lookahead Agents; see appendix: proofs full lookahead for the proof] Worst-case distributions:$CR^H(P,r) =\max_{\pi\in\Pi^{\mathcal{M}}}\frac{\sum_{(h,s,a)\in\mathcal{X}}d_h^{\pi}(s,a)r_h(s,a)}{\sum_{(h,s,a)\in\mathcal{X}}d_h^*(s)r_h(s,a)}.$ Worst-case rewa

Figures (3)

  • Figure 1: Examples: CR for grid and chain environments.
  • Figure 2: A near-worst-case environment: tree-like MDP. An agent can decide to stay at the root of the tree, but once it starts to traverse the tree, it must navigate to one of its leaves, from which it moves to a non-rewarding terminal state. All leaves have long-shot rewards, while all other nodes yield no reward.
  • Figure 3: Illustration of a possible flow on a grid graph, starting from the bottom-left corner and ending at the top-right corner. The first step is to distribute the flow on the bottom and leftmost states, such that there is excess flow of $\frac{1}{2(n-1)}$ flow in each of these states (green). At the leftmost state, this excess flow is sent at a direct line towards the right (blue), while in the bottom row, this flow is sent up (red). Such flow ensures that all edges have a minimal flow of $\frac{1}{2(n-1)}.$

Theorems & Definitions (22)

  • Definition 1
  • Remark 1
  • Remark 2
  • Definition 2
  • Theorem 1
  • Proposition 1
  • Theorem 2
  • Theorem 3: CR versus Full Lookahead Agents
  • proof
  • Proposition 3
  • ...and 12 more