On Representation Complexity of Model-based and Model-free Reinforcement Learning

Hanlin Zhu; Baihe Huang; Stuart Russell

On Representation Complexity of Model-based and Model-free Reinforcement Learning

Hanlin Zhu, Baihe Huang, Stuart Russell

TL;DR

This work proves theoretically that there exists a broad class of MDPs such that their underlying transition and reward functions can be represented by constant depth circuits with polynomial size, while the optimal $Q$-function suffers an exponential circuit complexity in constant-depth circuits.

Abstract

We study the representation complexity of model-based and model-free reinforcement learning (RL) in the context of circuit complexity. We prove theoretically that there exists a broad class of MDPs such that their underlying transition and reward functions can be represented by constant depth circuits with polynomial size, while the optimal $Q$-function suffers an exponential circuit complexity in constant-depth circuits. By drawing attention to the approximation errors and building connections to complexity theory, our theory provides unique insights into why model-based algorithms usually enjoy better sample complexity than model-free algorithms from a novel representation complexity perspective: in some cases, the ground-truth rule (model) of the environment is simple to represent, while other quantities, such as $Q$-function, appear complex. We empirically corroborate our theory by comparing the approximation error of the transition kernel, reward function, and optimal $Q$-function in various Mujoco environments, which demonstrates that the approximation errors of the transition kernel and reward function are consistently lower than those of the optimal $Q$-function. To the best of our knowledge, this work is the first to study the circuit complexity of RL, which also provides a rigorous framework for future research.

On Representation Complexity of Model-based and Model-free Reinforcement Learning

TL;DR

-function suffers an exponential circuit complexity in constant-depth circuits.

Abstract

-function suffers an exponential circuit complexity in constant-depth circuits. By drawing attention to the approximation errors and building connections to complexity theory, our theory provides unique insights into why model-based algorithms usually enjoy better sample complexity than model-free algorithms from a novel representation complexity perspective: in some cases, the ground-truth rule (model) of the environment is simple to represent, while other quantities, such as

-function, appear complex. We empirically corroborate our theory by comparing the approximation error of the transition kernel, reward function, and optimal

-function in various Mujoco environments, which demonstrates that the approximation errors of the transition kernel and reward function are consistently lower than those of the optimal

-function. To the best of our knowledge, this work is the first to study the circuit complexity of RL, which also provides a rigorous framework for future research.

Paper Structure (28 sections, 10 theorems, 40 equations, 9 figures, 2 tables)

This paper contains 28 sections, 10 theorems, 40 equations, 9 figures, 2 tables.

Introductions
Related work
Model-based v.s. Model-free algorithms.
Ground-truth dynamics.
Approximation error.
Circuit complexity.
Notations
Preliminaries
Markov Decision Process
Function approximation
Circuit complexity
Theoretical Results
Warm up example
A broader family of MDPs
Experiments
...and 13 more sections

Key Result

Proposition 2.6

$\mathsf{PARITY} \notin \mathbf{AC}^0$.

Figures (9)

Figure 1: Constant-depth circuit of the model transition function in parity MDP. An empty node is directly assigned the value of the node pointing to it.
Figure 2: Illustration of state for majority MDPs.
Figure 3: Approximation errors of the optimal $Q$-functions, reward functions, and transition functions in MuJoCo environments. In each environment, we run $5$ independent experiments and report the mean and standard deviation of the approximation errors. All curves (as well as those in Figure \ref{['fig:sac-approx-errors-2-128']}-\ref{['fig:sac-approx-errors-mc']}) are displayed with an exponential average smoothing with rate $0.2$.
Figure 4: Approximation errors of the optimal $Q$-functions, reward functions, and transition functions in MuJoCo environments, using $2$-(hidden) layer neural networks with width $128$. In each environment, we run $5$ independent experiments and report the mean and standard deviation of the approximation errors.
Figure 5: Approximation errors of the optimal $Q$-functions, reward functions, and transition functions in MuJoCo environments, using $1$-(hidden) layer neural networks with width $16$. In each environment, we run $5$ independent experiments and report the mean and standard deviation of the approximation errors.
...and 4 more figures

Theorems & Definitions (34)

Definition 2.1: Boolean circuits, adapted from Definition 6.1, arora2009computational
Definition 2.2: Circuit computation
Definition 2.3: $(k,m)$-DNF
Definition 2.4: $\mathbf{AC}^0$
Definition 2.5: Parity
Proposition 2.6: furst1984parity
Definition 3.1: Parity MDP
Definition 3.2: Control function
Definition 3.3: Majority MDP
Remark 3.4: Control function and control bits
...and 24 more

On Representation Complexity of Model-based and Model-free Reinforcement Learning

TL;DR

Abstract

On Representation Complexity of Model-based and Model-free Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (34)