Table of Contents
Fetching ...

On Representation Complexity of Model-based and Model-free Reinforcement Learning

Hanlin Zhu, Baihe Huang, Stuart Russell

TL;DR

This work proves theoretically that there exists a broad class of MDPs such that their underlying transition and reward functions can be represented by constant depth circuits with polynomial size, while the optimal $Q$-function suffers an exponential circuit complexity in constant-depth circuits.

Abstract

We study the representation complexity of model-based and model-free reinforcement learning (RL) in the context of circuit complexity. We prove theoretically that there exists a broad class of MDPs such that their underlying transition and reward functions can be represented by constant depth circuits with polynomial size, while the optimal $Q$-function suffers an exponential circuit complexity in constant-depth circuits. By drawing attention to the approximation errors and building connections to complexity theory, our theory provides unique insights into why model-based algorithms usually enjoy better sample complexity than model-free algorithms from a novel representation complexity perspective: in some cases, the ground-truth rule (model) of the environment is simple to represent, while other quantities, such as $Q$-function, appear complex. We empirically corroborate our theory by comparing the approximation error of the transition kernel, reward function, and optimal $Q$-function in various Mujoco environments, which demonstrates that the approximation errors of the transition kernel and reward function are consistently lower than those of the optimal $Q$-function. To the best of our knowledge, this work is the first to study the circuit complexity of RL, which also provides a rigorous framework for future research.

On Representation Complexity of Model-based and Model-free Reinforcement Learning

TL;DR

This work proves theoretically that there exists a broad class of MDPs such that their underlying transition and reward functions can be represented by constant depth circuits with polynomial size, while the optimal -function suffers an exponential circuit complexity in constant-depth circuits.

Abstract

We study the representation complexity of model-based and model-free reinforcement learning (RL) in the context of circuit complexity. We prove theoretically that there exists a broad class of MDPs such that their underlying transition and reward functions can be represented by constant depth circuits with polynomial size, while the optimal -function suffers an exponential circuit complexity in constant-depth circuits. By drawing attention to the approximation errors and building connections to complexity theory, our theory provides unique insights into why model-based algorithms usually enjoy better sample complexity than model-free algorithms from a novel representation complexity perspective: in some cases, the ground-truth rule (model) of the environment is simple to represent, while other quantities, such as -function, appear complex. We empirically corroborate our theory by comparing the approximation error of the transition kernel, reward function, and optimal -function in various Mujoco environments, which demonstrates that the approximation errors of the transition kernel and reward function are consistently lower than those of the optimal -function. To the best of our knowledge, this work is the first to study the circuit complexity of RL, which also provides a rigorous framework for future research.
Paper Structure (28 sections, 10 theorems, 40 equations, 9 figures, 2 tables)

This paper contains 28 sections, 10 theorems, 40 equations, 9 figures, 2 tables.

Key Result

Proposition 2.6

$\mathsf{PARITY} \notin \mathbf{AC}^0$.

Figures (9)

  • Figure 1: Constant-depth circuit of the model transition function in parity MDP. An empty node is directly assigned the value of the node pointing to it.
  • Figure 2: Illustration of state for majority MDPs.
  • Figure 3: Approximation errors of the optimal $Q$-functions, reward functions, and transition functions in MuJoCo environments. In each environment, we run $5$ independent experiments and report the mean and standard deviation of the approximation errors. All curves (as well as those in Figure \ref{['fig:sac-approx-errors-2-128']}-\ref{['fig:sac-approx-errors-mc']}) are displayed with an exponential average smoothing with rate $0.2$.
  • Figure 4: Approximation errors of the optimal $Q$-functions, reward functions, and transition functions in MuJoCo environments, using $2$-(hidden) layer neural networks with width $128$. In each environment, we run $5$ independent experiments and report the mean and standard deviation of the approximation errors.
  • Figure 5: Approximation errors of the optimal $Q$-functions, reward functions, and transition functions in MuJoCo environments, using $1$-(hidden) layer neural networks with width $16$. In each environment, we run $5$ independent experiments and report the mean and standard deviation of the approximation errors.
  • ...and 4 more figures

Theorems & Definitions (34)

  • Definition 2.1: Boolean circuits, adapted from Definition 6.1, arora2009computational
  • Definition 2.2: Circuit computation
  • Definition 2.3: $(k,m)$-DNF
  • Definition 2.4: $\mathbf{AC}^0$
  • Definition 2.5: Parity
  • Proposition 2.6: furst1984parity
  • Definition 3.1: Parity MDP
  • Definition 3.2: Control function
  • Definition 3.3: Majority MDP
  • Remark 3.4: Control function and control bits
  • ...and 24 more