Table of Contents
Fetching ...

Sample-Efficiency in Multi-Batch Reinforcement Learning: The Need for Dimension-Dependent Adaptivity

Emmeran Johnson, Ciara Pike-Burke, Patrick Rebeschini

TL;DR

The paper investigates the trade-off between adaptivity and sample-efficiency in reinforcement learning under a multi-batch data-collection model, focusing on infinite-horizon discounted MDPs with linear function approximation. It proves an $oldsymbol{Ω(\, ext{log log } d)}$ lower bound on the number of batches needed to achieve sample-efficient learning for both PE and BPI, showing that mere adaptivity ($K>1$) is insufficient and that the boundary scales with the dimension $d$. The authors extend Zanette BatchRL's offline framework to the multi-batch setting and employ subspace-packing and hyper-spherical sector techniques to construct hard MDP instances where information can be erased along multiple dimensions across batches. They provide both policy-induced and policy-free lower bounds, discuss implications for low-adaptivity design, and outline open questions regarding potential upper bounds and the tightness of the $Ω(\log \log d)$ dependence. Overall, the work reshapes our understanding of how adaptivity and dimension interact to govern sample-efficiency in RL, guiding the design of low-adaptivity algorithms in high-dimensional settings.

Abstract

We theoretically explore the relationship between sample-efficiency and adaptivity in reinforcement learning. An algorithm is sample-efficient if it uses a number of queries $n$ to the environment that is polynomial in the dimension $d$ of the problem. Adaptivity refers to the frequency at which queries are sent and feedback is processed to update the querying strategy. To investigate this interplay, we employ a learning framework that allows sending queries in $K$ batches, with feedback being processed and queries updated after each batch. This model encompasses the whole adaptivity spectrum, ranging from non-adaptive 'offline' ($K=1$) to fully adaptive ($K=n$) scenarios, and regimes in between. For the problems of policy evaluation and best-policy identification under $d$-dimensional linear function approximation, we establish $Ω(\log \log d)$ lower bounds on the number of batches $K$ required for sample-efficient algorithms with $n = O(poly(d))$ queries. Our results show that just having adaptivity ($K>1$) does not necessarily guarantee sample-efficiency. Notably, the adaptivity-boundary for sample-efficiency is not between offline reinforcement learning ($K=1$), where sample-efficiency was known to not be possible, and adaptive settings. Instead, the boundary lies between different regimes of adaptivity and depends on the problem dimension.

Sample-Efficiency in Multi-Batch Reinforcement Learning: The Need for Dimension-Dependent Adaptivity

TL;DR

The paper investigates the trade-off between adaptivity and sample-efficiency in reinforcement learning under a multi-batch data-collection model, focusing on infinite-horizon discounted MDPs with linear function approximation. It proves an lower bound on the number of batches needed to achieve sample-efficient learning for both PE and BPI, showing that mere adaptivity () is insufficient and that the boundary scales with the dimension . The authors extend Zanette BatchRL's offline framework to the multi-batch setting and employ subspace-packing and hyper-spherical sector techniques to construct hard MDP instances where information can be erased along multiple dimensions across batches. They provide both policy-induced and policy-free lower bounds, discuss implications for low-adaptivity design, and outline open questions regarding potential upper bounds and the tightness of the dependence. Overall, the work reshapes our understanding of how adaptivity and dimension interact to govern sample-efficiency in RL, guiding the design of low-adaptivity algorithms in high-dimensional settings.

Abstract

We theoretically explore the relationship between sample-efficiency and adaptivity in reinforcement learning. An algorithm is sample-efficient if it uses a number of queries to the environment that is polynomial in the dimension of the problem. Adaptivity refers to the frequency at which queries are sent and feedback is processed to update the querying strategy. To investigate this interplay, we employ a learning framework that allows sending queries in batches, with feedback being processed and queries updated after each batch. This model encompasses the whole adaptivity spectrum, ranging from non-adaptive 'offline' () to fully adaptive () scenarios, and regimes in between. For the problems of policy evaluation and best-policy identification under -dimensional linear function approximation, we establish lower bounds on the number of batches required for sample-efficient algorithms with queries. Our results show that just having adaptivity () does not necessarily guarantee sample-efficiency. Notably, the adaptivity-boundary for sample-efficiency is not between offline reinforcement learning (), where sample-efficiency was known to not be possible, and adaptive settings. Instead, the boundary lies between different regimes of adaptivity and depends on the problem dimension.
Paper Structure (46 sections, 10 theorems, 107 equations, 4 figures, 1 algorithm)

This paper contains 46 sections, 10 theorems, 107 equations, 4 figures, 1 algorithm.

Key Result

Theorem 4.4

Fix $d$ sufficiently large. There exists a class of MDPs $\mathcal{M}$ and policies $\Pi$ defining PE problems $(\Bar{s}, M, \mathcal{M}, \pi_M, \Pi)$ satisfying Assumption assumption:Realizability such that any sample-efficient learner better than $(1,1/2)$-sound using policy-induced or policy-free

Figures (4)

  • Figure 1: Left: Information can be erased in multiple directions: Consider the setting where information is being erased along the pink plane $\mathcal{N}$: the learner's queries $\phi(s_i, a_i)$ are shown in blue and the environment's responses $-\gamma \phi(s_i^+, \pi_M(s_i^+))$ are shown in black. The rows of $X$, $\phi(s_i, a_i) -\gamma \phi(s_i^+, \pi_M(s_i^+))$ (blue + black vectors) all lie on the yellow line so the learner acquires no information in the directions of the pink subspace $\mathcal{N}$, the null-space of $X$. Right: Information cannot be erased in all directions: Consider the opposite setting where information is being erased along the pink line $\mathcal{N}$. Because of the constraint $\|\phi(s,a)\|_2\leq 1$ and $\gamma < 1$, a query $\phi_1$ (in blue) in the pink cap cannot have $\phi_1 - \gamma \phi^+_1$ (blue + black vector) projected back onto the yellow plane (unless $\|\gamma \phi_1\|_2 > \gamma \implies \|\phi_1\|_2 > 1$). Despite the environment not being able to erase information in certain directions, if the number of queries is "small", it can always find directions to erase.
  • Figure 2: Illustration of a hyperspherical sector of a $1$-dimensional subspace (left) and a $2$-dimensional subspace (right - all vectors whose direction is within the two pink bands). In both cases, the subspace $H$ is in yellow.
  • Figure 3: Round 1.
  • Figure 4: Round 2.

Theorems & Definitions (21)

  • Definition 3.1
  • Definition 3.2
  • Definition 3.3: Query-Feedback
  • Remark 3.4
  • Definition 3.5: Policy-Induced Queries (Zanette_BatchRL, Definition 2)
  • Remark 3.6
  • Definition 4.3
  • Theorem 4.4
  • Theorem 4.5
  • Theorem B.2
  • ...and 11 more