Table of Contents
Fetching ...

On Value Functions and the Agent-Environment Boundary

Nan Jiang

TL;DR

The paper investigates how the agent-environment boundary affects the definability of value functions and the guarantees of RL algorithms with function approximation. It proposes a boundary-invariant analysis framework, demonstrated through a boundary-invariant rendition of Fitted Q-Iteration and supported by boundary-invariant treatment of contextual bandits. The work shows that, under boundary-invariant assumptions, near-optimal guarantees hold regardless of boundary choice and discusses implications for state resetting, MCTS, imitation learning, and verifiability. It encourages rethinking states and value functions in RL and highlights practical considerations when boundaries are ambiguous or unavailable.

Abstract

When function approximation is deployed in reinforcement learning (RL), the same problem may be formulated in different ways, often by treating a pre-processing step as a part of the environment or as part of the agent. As a consequence, fundamental concepts in RL, such as (optimal) value functions, are not uniquely defined as they depend on where we draw this agent-environment boundary, causing problems in theoretical analyses that provide optimality guarantees. We address this issue via a simple and novel boundary-invariant analysis of Fitted Q-Iteration, a representative RL algorithm, where the assumptions and the guarantees are invariant to the choice of boundary. We also discuss closely related issues on state resetting and Monte-Carlo Tree Search, deterministic vs stochastic systems, imitation learning, and the verifiability of theoretical assumptions from data.

On Value Functions and the Agent-Environment Boundary

TL;DR

The paper investigates how the agent-environment boundary affects the definability of value functions and the guarantees of RL algorithms with function approximation. It proposes a boundary-invariant analysis framework, demonstrated through a boundary-invariant rendition of Fitted Q-Iteration and supported by boundary-invariant treatment of contextual bandits. The work shows that, under boundary-invariant assumptions, near-optimal guarantees hold regardless of boundary choice and discusses implications for state resetting, MCTS, imitation learning, and verifiability. It encourages rethinking states and value functions in RL and highlights practical considerations when boundaries are ambiguous or unavailable.

Abstract

When function approximation is deployed in reinforcement learning (RL), the same problem may be formulated in different ways, often by treating a pre-processing step as a part of the environment or as part of the agent. As a consequence, fundamental concepts in RL, such as (optimal) value functions, are not uniquely defined as they depend on where we draw this agent-environment boundary, causing problems in theoretical analyses that provide optimality guarantees. We address this issue via a simple and novel boundary-invariant analysis of Fitted Q-Iteration, a representative RL algorithm, where the assumptions and the guarantees are invariant to the choice of boundary. We also discuss closely related issues on state resetting and Monte-Carlo Tree Search, deterministic vs stochastic systems, imitation learning, and the verifiability of theoretical assumptions from data.

Paper Structure

This paper contains 27 sections, 9 theorems, 36 equations, 2 figures.

Key Result

Theorem 1

Under Assumptions asm:cb_explore and asm:cb_realizable, $v^{\pi_{\hat{f}}} \ge v^\star - 2\sqrt{C\epsilon}.$

Figures (2)

  • Figure 1: Illustration of the agent-environment boundaries. Strickly speaking, the environments defined by the intermediate boundaries are partially observable, but we can view them as MDPs over histories (of actions and observations defined by the boundary). Therefore, partial observability has little to do with our concerns, and we stick to MDP terminologies in the main text and do not invoke POMDP concepts for simplicity and clarity.
  • Figure 2: Illustration of two different formulations of the same problem when function approximation is deployed. (a) A contextual bandit with two contexts, $s_A$ and $s_B$, which appear with equal probabilities, i.e., $d_0(s_A) = d_0(s_B)=0.5$. The only action yields $+1$ reward and $+0$ reward in $s_A$ and $s_B$, respectively. The function approximator contains only 1 function $Q(s_A) = Q(s_B) = 0.5$ (action omitted since there is only 1 action). (b) A contextual bandit with one context $s$. The only action available yields a Bernoulli distributed stochastic reward. The function approximator contains only 1 function $Q'(s)=0.5$.

Theorems & Definitions (18)

  • Theorem 1
  • Theorem 2: Robust version of Theorem \ref{['thm:cb_lt']}
  • Claim 3
  • Definition 1: Admissible distributions (bandit)
  • Theorem 4
  • Proposition 5
  • Definition 2: Admissible distributions (MDP)
  • Theorem 6
  • Lemma 7: Boundary-invariant version of $\gamma$-contraction
  • Theorem 8
  • ...and 8 more