Rethinking Model-based, Policy-based, and Value-based Reinforcement Learning via the Lens of Representation Complexity

Guhao Feng; Han Zhong

Rethinking Model-based, Policy-based, and Value-based Reinforcement Learning via the Lens of Representation Complexity

Guhao Feng, Han Zhong

TL;DR

The paper investigates a representation-complexity hierarchy across RL paradigms by constructing MDP families tied to classical complexity classes. It shows that models can be captured by constant-depth circuits ($AC^0$) or small MLPs, while optimal policies and values can require $\mathsf{NP}$-complete or $\mathsf{P}$-complete representations, revealing a hierarchy: model <= policy <= value. By introducing 3-SAT MDP, $\mathsf{NP}$ MDP, CVP MDP, and $\mathsf{P}$ MDP, the work connects circuit/MLP expressiveness to RL targets and extends results to log-precision MLPs and Transformer architectures, with empirical support from MuJoCo experiments. The findings have practical implications for designing sample-efficient RL systems and understanding which targets to approximate, potentially guiding architecture choices in deep RL. Overall, the paper provides a rigorous, representation-theory-based perspective on RL that complements existing statistical and optimization analyses.

Abstract

Reinforcement Learning (RL) encompasses diverse paradigms, including model-based RL, policy-based RL, and value-based RL, each tailored to approximate the model, optimal policy, and optimal value function, respectively. This work investigates the potential hierarchy of representation complexity -- the complexity of functions to be represented -- among these RL paradigms. We first demonstrate that, for a broad class of Markov decision processes (MDPs), the model can be represented by constant-depth circuits with polynomial size or Multi-Layer Perceptrons (MLPs) with constant layers and polynomial hidden dimension. However, the representation of the optimal policy and optimal value proves to be $\mathsf{NP}$-complete and unattainable by constant-layer MLPs with polynomial size. This demonstrates a significant representation complexity gap between model-based RL and model-free RL, which includes policy-based RL and value-based RL. To further explore the representation complexity hierarchy between policy-based RL and value-based RL, we introduce another general class of MDPs where both the model and optimal policy can be represented by constant-depth circuits with polynomial size or constant-layer MLPs with polynomial size. In contrast, representing the optimal value is $\mathsf{P}$-complete and intractable via a constant-layer MLP with polynomial hidden dimension. This accentuates the intricate representation complexity associated with value-based RL compared to policy-based RL. In summary, we unveil a potential representation complexity hierarchy within RL -- representing the model emerges as the easiest task, followed by the optimal policy, while representing the optimal value function presents the most intricate challenge.

Rethinking Model-based, Policy-based, and Value-based Reinforcement Learning via the Lens of Representation Complexity

TL;DR

) or small MLPs, while optimal policies and values can require

-complete or

-complete representations, revealing a hierarchy: model <= policy <= value. By introducing 3-SAT MDP,

MDP, CVP MDP, and

MDP, the work connects circuit/MLP expressiveness to RL targets and extends results to log-precision MLPs and Transformer architectures, with empirical support from MuJoCo experiments. The findings have practical implications for designing sample-efficient RL systems and understanding which targets to approximate, potentially guiding architecture choices in deep RL. Overall, the paper provides a rigorous, representation-theory-based perspective on RL that complements existing statistical and optimization analyses.

Abstract

-complete and unattainable by constant-layer MLPs with polynomial size. This demonstrates a significant representation complexity gap between model-based RL and model-free RL, which includes policy-based RL and value-based RL. To further explore the representation complexity hierarchy between policy-based RL and value-based RL, we introduce another general class of MDPs where both the model and optimal policy can be represented by constant-depth circuits with polynomial size or constant-layer MLPs with polynomial size. In contrast, representing the optimal value is

-complete and intractable via a constant-layer MLP with polynomial hidden dimension. This accentuates the intricate representation complexity associated with value-based RL compared to policy-based RL. In summary, we unveil a potential representation complexity hierarchy within RL -- representing the model emerges as the easiest task, followed by the optimal policy, while representing the optimal value function presents the most intricate challenge.

Paper Structure (41 sections, 22 theorems, 79 equations, 3 figures, 4 tables)

This paper contains 41 sections, 22 theorems, 79 equations, 3 figures, 4 tables.

Introduction
Our Contributions
Related Works
Representation Complexity in RL.
Classic Computational Complexity Results.
Model-based RL, Policy-based RL, and Value-based RL.
Notations
Preliminaries
Markov Decision Process
Function Approximation in Model-based, Policy-based, and Value-based RL
Computational Complexity
The Separation between Model-based RL and Model-free RL
3-SAT MDP
$\mathsf{NP}$ MDP: A Broader Class of MDPs
The Separation between Policy-based RL and Value-based RL
...and 26 more sections

Key Result

Theorem 3.3

Let $\mathcal{M}_n$ be the $n$-dimensional 3-SAT MDP in Definition def:3satmdp. The transition kernel $\mathcal{P}$ and the reward function $r$ of $\mathcal{M}_n$ can be computed by circuits with polynomial size (in $n$) and constant depth, falling within the circuit complexity class $\mathsf{AC}^0$

Figures (3)

Figure 1: A visualization of 3-SAT MDPs. Here, $\mathbf{v}$ is an $n$-dimensional vector, $\mathbf{v}_0$ and $\mathbf{v}_1$ are vectors obtained by replacing the $k$-th element of $\mathbf{v}$ with $0$ and $1$, respectively. Additionally, $\mathbf{v}_{\mathrm{end}}$, $\mathbf{v}'_{\mathrm{end}}$, and $\mathbf{v}"_{\mathrm{end}}$ represent the assignments at the end of the episode.
Figure 2: A visualization of CVP MDPs. Here, $\mathbf{v}_{\mathrm{unknown}}$, which contains $n$Unknown values, is the initial value vector. For any state $s$ including a circuit $\mathbf{c}$ and a value vector $\mathbf{v}$, choosing the action $i$, the environment transits to $(\mathbf{c}, \mathbf{v}'_i)$. Moreover, $\mathbf{v}_{\mathrm{end}}$, $\mathbf{v}'_{\mathrm{end}}$, and $\mathbf{v}"_{\mathrm{end}}$ are value vectors at the end of the episode.
Figure 3: The approximation errors computed by employing MLPs with varying depths $d$ and widths $w$ to approximate the transition kernel, reward function, optimal policy, and optimal Q-function in four MuJoCo environments. In each subfigure, the title indicates the configuration including hidden dimensions, number of layers, and dataset size. The x-axis lists the four MuJoCo environments, where H.C. represents HalfCheetah and I.P. represents InvertedPendulum. The y-axis represents the approximation error defined in \ref{['eq:objecctive']}.

Theorems & Definitions (38)

Definition 2.1: Boolean Circuits
Definition 3.1: 3-SAT Problem
Definition 3.2: 3-SAT MDP
Theorem 3.3: Representation complexity of 3-SAT MDP
Remark 3.4
Remark 3.5
Remark 3.6: Extension to the Stochastic Setting
Remark 3.7: Connection to POMDP
Definition 3.8: $\mathsf{NP}$ MDP
Theorem 3.9: Representation complexity of $\mathsf{NP}$ MDP
...and 28 more

Rethinking Model-based, Policy-based, and Value-based Reinforcement Learning via the Lens of Representation Complexity

TL;DR

Abstract

Rethinking Model-based, Policy-based, and Value-based Reinforcement Learning via the Lens of Representation Complexity

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (38)