Rethinking Model-based, Policy-based, and Value-based Reinforcement Learning via the Lens of Representation Complexity
Guhao Feng, Han Zhong
TL;DR
The paper investigates a representation-complexity hierarchy across RL paradigms by constructing MDP families tied to classical complexity classes. It shows that models can be captured by constant-depth circuits ($AC^0$) or small MLPs, while optimal policies and values can require $\mathsf{NP}$-complete or $\mathsf{P}$-complete representations, revealing a hierarchy: model <= policy <= value. By introducing 3-SAT MDP, $\mathsf{NP}$ MDP, CVP MDP, and $\mathsf{P}$ MDP, the work connects circuit/MLP expressiveness to RL targets and extends results to log-precision MLPs and Transformer architectures, with empirical support from MuJoCo experiments. The findings have practical implications for designing sample-efficient RL systems and understanding which targets to approximate, potentially guiding architecture choices in deep RL. Overall, the paper provides a rigorous, representation-theory-based perspective on RL that complements existing statistical and optimization analyses.
Abstract
Reinforcement Learning (RL) encompasses diverse paradigms, including model-based RL, policy-based RL, and value-based RL, each tailored to approximate the model, optimal policy, and optimal value function, respectively. This work investigates the potential hierarchy of representation complexity -- the complexity of functions to be represented -- among these RL paradigms. We first demonstrate that, for a broad class of Markov decision processes (MDPs), the model can be represented by constant-depth circuits with polynomial size or Multi-Layer Perceptrons (MLPs) with constant layers and polynomial hidden dimension. However, the representation of the optimal policy and optimal value proves to be $\mathsf{NP}$-complete and unattainable by constant-layer MLPs with polynomial size. This demonstrates a significant representation complexity gap between model-based RL and model-free RL, which includes policy-based RL and value-based RL. To further explore the representation complexity hierarchy between policy-based RL and value-based RL, we introduce another general class of MDPs where both the model and optimal policy can be represented by constant-depth circuits with polynomial size or constant-layer MLPs with polynomial size. In contrast, representing the optimal value is $\mathsf{P}$-complete and intractable via a constant-layer MLP with polynomial hidden dimension. This accentuates the intricate representation complexity associated with value-based RL compared to policy-based RL. In summary, we unveil a potential representation complexity hierarchy within RL -- representing the model emerges as the easiest task, followed by the optimal policy, while representing the optimal value function presents the most intricate challenge.
