Table of Contents
Fetching ...

On Computation and Reinforcement Learning

Raj Ghugare, Michał Bortkiewicz, Alicja Ziarko, Benjamin Eysenbach

TL;DR

The paper argues that compute duration, not just parameter count, determines RL performance. It formalizes compute-bounded policies and proves a Policy Hierarchy Theorem and a Long Horizon Generalization result, showing that more compute can solve broader MDPs and generalize to longer horizons, while less compute may overfit. To study this, it introduces a minimal recurrent architecture, the IRU, that uses a fixed parameter budget but scales compute through iterative application, enabling strong performance and horizon generalization across 31 tasks. Empirical results demonstrate that increasing recurrent steps yields substantial gains and offer a method to measure the value of compute (VoC) during RL. The work suggests computation should be treated as a distinct resource in RL design, with potential adaptive compute strategies and extensions to transformer-based architectures as fruitful future directions.

Abstract

How does the amount of compute available to a reinforcement learning (RL) policy affect its learning? Can policies using a fixed amount of parameters, still benefit from additional compute? The standard RL framework does not provide a language to answer these questions formally. Empirically, deep RL policies are often parameterized as neural networks with static architectures, conflating the amount of compute and the number of parameters. In this paper, we formalize compute bounded policies and prove that policies which use more compute can solve problems and generalize to longer-horizon tasks that are outside the scope of policies with less compute. Building on prior work in algorithmic learning and model-free planning, we propose a minimal architecture that can use a variable amount of compute. Our experiments complement our theory. On a set 31 different tasks spanning online and offline RL, we show that $(1)$ this architecture achieves stronger performance simply by using more compute, and $(2)$ stronger generalization on longer-horizon test tasks compared to standard feedforward networks or deep residual network using up to 5 times more parameters.

On Computation and Reinforcement Learning

TL;DR

The paper argues that compute duration, not just parameter count, determines RL performance. It formalizes compute-bounded policies and proves a Policy Hierarchy Theorem and a Long Horizon Generalization result, showing that more compute can solve broader MDPs and generalize to longer horizons, while less compute may overfit. To study this, it introduces a minimal recurrent architecture, the IRU, that uses a fixed parameter budget but scales compute through iterative application, enabling strong performance and horizon generalization across 31 tasks. Empirical results demonstrate that increasing recurrent steps yields substantial gains and offer a method to measure the value of compute (VoC) during RL. The work suggests computation should be treated as a distinct resource in RL design, with potential adaptive compute strategies and extensions to transformer-based architectures as fruitful future directions.

Abstract

How does the amount of compute available to a reinforcement learning (RL) policy affect its learning? Can policies using a fixed amount of parameters, still benefit from additional compute? The standard RL framework does not provide a language to answer these questions formally. Empirically, deep RL policies are often parameterized as neural networks with static architectures, conflating the amount of compute and the number of parameters. In this paper, we formalize compute bounded policies and prove that policies which use more compute can solve problems and generalize to longer-horizon tasks that are outside the scope of policies with less compute. Building on prior work in algorithmic learning and model-free planning, we propose a minimal architecture that can use a variable amount of compute. Our experiments complement our theory. On a set 31 different tasks spanning online and offline RL, we show that this architecture achieves stronger performance simply by using more compute, and stronger generalization on longer-horizon test tasks compared to standard feedforward networks or deep residual network using up to 5 times more parameters.
Paper Structure (30 sections, 4 theorems, 11 equations, 7 figures, 3 tables)

This paper contains 30 sections, 4 theorems, 11 equations, 7 figures, 3 tables.

Key Result

Theorem 4.1

Let $g(n)$ and $t(n)$ be time-constructible functions such that $g(n) \in o(t(n) / \log t(n))$. Then there exists an MDP and a function $f(n) \in {\mathcal{O}}(t(n))$ such that:

Figures (7)

  • Figure 1: Complete recurrent architecture. This figure demonstrates the architecture for training policies/value functions. The recurrent block we use is an IRU.
  • Figure 2: Scaling recurrent steps in discrete environments. Both Boxpick tasks improve as the number of recurrent steps increases, with performance often peaking at five recurrent steps.
  • Figure 3: Scaling up recurrent steps in continuous environments improves performance in OGBench tasks ogbench_park2025. Interestingly, additional recurrent steps considerably improve performance mainly in tasks that involve long-horizon reasoning (scene, cube, and puzzle), while performance in stitching navigation tasks increases marginally with more steps.
  • Figure 4: Do recurrent steps improve generalization? Throughout training, we track how policies learned by the IRU architecture and baselines perform on unseen tasks, including those that require more steps to solve. The IRU architecture learns faster (on $5/5$ tasks) and converges to higher asymptotic performance (on $4/5$ tasks) than MLPs and deep ResNets. The gains from IRU are most pronounced on lightsout-4x4, where all other architectures achieve only trivial performance.
  • Figure 5: The Value of Compute. We plot the value of compute of IRU-$(5)$ over IRU-$(1)$ for different number of steps. We see that as the VoC increases with the number of steps using less compute, and VoC also peaks at the early half of the episode where choosing the correct action is both difficult and crucial.
  • ...and 2 more figures

Theorems & Definitions (8)

  • Definition 1: Time bounded Turing machines
  • Definition 2: Time bounded policies and policy classes
  • Theorem 4.1: Policy Hierarchy Theorem
  • Theorem 4.2: Long Horizon Generalization
  • Theorem 1.1: Policy Hierarchy Theorem
  • proof
  • Theorem 1.2: Long Horizon Generalization
  • proof