An Optimal Tightness Bound for the Simulation Lemma

Sam Lobel; Ronald Parr

An Optimal Tightness Bound for the Simulation Lemma

Sam Lobel, Ronald Parr

TL;DR

This work presents a bound for value-prediction error with respect to model misspecification that is tight, including constant factors, and derives a bound that is sub-linear with respect to transition function misspecification.

Abstract

We present a bound for value-prediction error with respect to model misspecification that is tight, including constant factors. This is a direct improvement of the "simulation lemma," a foundational result in reinforcement learning. We demonstrate that existing bounds are quite loose, becoming vacuous for large discount factors, due to the suboptimal treatment of compounding probability errors. By carefully considering this quantity on its own, instead of as a subcomponent of value error, we derive a bound that is sub-linear with respect to transition function misspecification. We then demonstrate broader applicability of this technique, improving a similar bound in the related subfield of hierarchical abstraction.

An Optimal Tightness Bound for the Simulation Lemma

TL;DR

Abstract

Paper Structure (17 sections, 1 theorem, 33 equations, 2 figures)

This paper contains 17 sections, 1 theorem, 33 equations, 2 figures.

Introduction
Background and Related Work
Exploration
Abstraction
Offline Policy Evaluation
Main Result
Original Simulation Lemma
Bounding Probability Distance
A Tight Bound on Value Error
Proof of Tightness
Value Loss of Optimal Policy
Application to Hierarchy
Conclusion
Full proof of Simulation Lemma
Application to the Finite-Horizon Setting
...and 2 more sections

Key Result

theorem 1

For two MDPs $\mathcal{M}$ and $\hat{\mathcal{M}}$ related as described in Equations eq:original-sim-lemma-t-condition and eq:original-sim-lemma-r-condition, the following inequality holds: Furthermore, this bound is tight.

Figures (2)

Figure 1: Visualization of relation between $L_1$ distance and overlap of two probability distributions (Equation \ref{['eq:tvd-l1-equivalence']}). The blue and orange shaded regions together comprise the $L_1$ distance. The brown region represents overlap. Overlap plus either the blue or orange sections constitutes a probability distribution, and therefore has total area $1$. Thus the blue and orange regions both individually have area ${\lVert p - \hat{p}\rVert_1 / 2}$, and so ${\lVert \bar{p}\rVert_1 = 1 - \lVert p - \hat{p}\rVert_1 /2}$.
Figure 2: Bounds on value error given by original simulation lemma as well as our tighter bounds, normalized by $V_{MAX}$. (Left) Bound on value error with increasing gamma shows the original lemma's suboptimality with respect to discount. (Right) Bound on value error with increasing misspecification shows looseness of linear approximation compared to the tight bound.

Theorems & Definitions (1)

theorem 1

An Optimal Tightness Bound for the Simulation Lemma

TL;DR

Abstract

An Optimal Tightness Bound for the Simulation Lemma

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (1)