On the continuity and smoothness of the value function in reinforcement learning and optimal control

Hans Harder; Sebastian Peitz

On the continuity and smoothness of the value function in reinforcement learning and optimal control

Hans Harder, Sebastian Peitz

TL;DR

It is shown that the value function is always Hölder continuous under relatively weak assumptions on the underlying system and that non-differentiable value functions can be made differentiable by slightly “disturbing” the system.

Abstract

The value function plays a crucial role as a measure for the cumulative future reward an agent receives in both reinforcement learning and optimal control. It is therefore of interest to study how similar the values of neighboring states are, i.e., to investigate the continuity of the value function. We do so by providing and verifying upper bounds on the value function's modulus of continuity. Additionally, we show that the value function is always Hölder continuous under relatively weak assumptions on the underlying system and that non-differentiable value functions can be made differentiable by slightly "disturbing" the system.

On the continuity and smoothness of the value function in reinforcement learning and optimal control

TL;DR

Abstract

Paper Structure (9 sections, 10 theorems, 57 equations, 3 figures)

This paper contains 9 sections, 10 theorems, 57 equations, 3 figures.

Introduction
Definitions
Continuity of the value function
Hölder and Lipschitz continuity
Sharpness and experiments
An example for a steep value function
Sharpness
Disturbance implies Differentiability
Conclusion

Key Result

Proposition 1

Let $\Phi(x) = 4x(1-x)$ be the logistic map on $S=[0,1]$, put $r(x) = x$ and let $\gamma \in [\frac{1}{2}, 1)$. Then $v(x)=\sum_{n=0}^\infty \gamma^n r(\Phi^n(x))$ is nowhere differentiable.

Figures (3)

Figure 1: The value function $v$ from \ref{['thr:logistic']} for the discount factor $\gamma = 0.8$. The "smoothed" version $w$ is the value function that one obtains when disturbing the same system using Gaussian noise with standard deviation $\sigma = 0.01$, cf. \ref{['thr:differentiability']}.
Figure 2: Visual depiction of the idea in \ref{['thr:hoelder-integrals']}.
Figure 3: Left: The value functions corresponding to the example in \ref{['sec:sharpness_and_example']} for $L = 1.5$ and discount factors $(\gamma_0, \gamma_1, \gamma_2) = (0.5, 0.9, 0.99)$ when normalized to a maximal value of $1$. Right: The moduli of continuity for the value functions in comparison to the bounds given by \ref{['thr:hoelder-sums']}, visualized by dashed lines in the same color. The bounds from bernsteinAdaptiveresolutionReinforcementLearning2010 are visualized using dotted lines (also the same color).

Theorems & Definitions (24)

Proposition 1: see yamagutiWeierstrassFunctionChaos1983
Remark 1
Proposition 2
proof
Definition 1
Example 1
Remark 2
Remark 3
Theorem 1
proof
...and 14 more

On the continuity and smoothness of the value function in reinforcement learning and optimal control

TL;DR

Abstract

On the continuity and smoothness of the value function in reinforcement learning and optimal control

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (24)