Sample-Efficiency in Multi-Batch Reinforcement Learning: The Need for Dimension-Dependent Adaptivity

Emmeran Johnson; Ciara Pike-Burke; Patrick Rebeschini

Sample-Efficiency in Multi-Batch Reinforcement Learning: The Need for Dimension-Dependent Adaptivity

Emmeran Johnson, Ciara Pike-Burke, Patrick Rebeschini

TL;DR

The paper investigates the trade-off between adaptivity and sample-efficiency in reinforcement learning under a multi-batch data-collection model, focusing on infinite-horizon discounted MDPs with linear function approximation. It proves an $oldsymbol{Ω(\, ext{log log } d)}$ lower bound on the number of batches needed to achieve sample-efficient learning for both PE and BPI, showing that mere adaptivity ($K>1$) is insufficient and that the boundary scales with the dimension $d$. The authors extend Zanette BatchRL's offline framework to the multi-batch setting and employ subspace-packing and hyper-spherical sector techniques to construct hard MDP instances where information can be erased along multiple dimensions across batches. They provide both policy-induced and policy-free lower bounds, discuss implications for low-adaptivity design, and outline open questions regarding potential upper bounds and the tightness of the $Ω(\log \log d)$ dependence. Overall, the work reshapes our understanding of how adaptivity and dimension interact to govern sample-efficiency in RL, guiding the design of low-adaptivity algorithms in high-dimensional settings.

Abstract

We theoretically explore the relationship between sample-efficiency and adaptivity in reinforcement learning. An algorithm is sample-efficient if it uses a number of queries $n$ to the environment that is polynomial in the dimension $d$ of the problem. Adaptivity refers to the frequency at which queries are sent and feedback is processed to update the querying strategy. To investigate this interplay, we employ a learning framework that allows sending queries in $K$ batches, with feedback being processed and queries updated after each batch. This model encompasses the whole adaptivity spectrum, ranging from non-adaptive 'offline' ($K=1$) to fully adaptive ($K=n$) scenarios, and regimes in between. For the problems of policy evaluation and best-policy identification under $d$-dimensional linear function approximation, we establish $Ω(\log \log d)$ lower bounds on the number of batches $K$ required for sample-efficient algorithms with $n = O(poly(d))$ queries. Our results show that just having adaptivity ($K>1$) does not necessarily guarantee sample-efficiency. Notably, the adaptivity-boundary for sample-efficiency is not between offline reinforcement learning ($K=1$), where sample-efficiency was known to not be possible, and adaptive settings. Instead, the boundary lies between different regimes of adaptivity and depends on the problem dimension.

Sample-Efficiency in Multi-Batch Reinforcement Learning: The Need for Dimension-Dependent Adaptivity

TL;DR

lower bound on the number of batches needed to achieve sample-efficient learning for both PE and BPI, showing that mere adaptivity (

) is insufficient and that the boundary scales with the dimension

. The authors extend Zanette BatchRL's offline framework to the multi-batch setting and employ subspace-packing and hyper-spherical sector techniques to construct hard MDP instances where information can be erased along multiple dimensions across batches. They provide both policy-induced and policy-free lower bounds, discuss implications for low-adaptivity design, and outline open questions regarding potential upper bounds and the tightness of the

dependence. Overall, the work reshapes our understanding of how adaptivity and dimension interact to govern sample-efficiency in RL, guiding the design of low-adaptivity algorithms in high-dimensional settings.

Abstract

We theoretically explore the relationship between sample-efficiency and adaptivity in reinforcement learning. An algorithm is sample-efficient if it uses a number of queries

to the environment that is polynomial in the dimension

of the problem. Adaptivity refers to the frequency at which queries are sent and feedback is processed to update the querying strategy. To investigate this interplay, we employ a learning framework that allows sending queries in

batches, with feedback being processed and queries updated after each batch. This model encompasses the whole adaptivity spectrum, ranging from non-adaptive 'offline' (

) to fully adaptive (

) scenarios, and regimes in between. For the problems of policy evaluation and best-policy identification under

-dimensional linear function approximation, we establish

lower bounds on the number of batches

required for sample-efficient algorithms with

queries. Our results show that just having adaptivity (

) does not necessarily guarantee sample-efficiency. Notably, the adaptivity-boundary for sample-efficiency is not between offline reinforcement learning (

), where sample-efficiency was known to not be possible, and adaptive settings. Instead, the boundary lies between different regimes of adaptivity and depends on the problem dimension.

Paper Structure (46 sections, 10 theorems, 107 equations, 4 figures, 1 algorithm)

This paper contains 46 sections, 10 theorems, 107 equations, 4 figures, 1 algorithm.

Introduction
Preliminaries
Problem Setting
Policy Evaluation (PE)
Best Policy Identification (BPI)
Multi-Batch Learning Model
Main Results
Policy-Induced Queries
Policy-Free Queries
Discussion
Related Works
Proof Sketch
Conclusion
Further Related Works
Bounds for the Fully Adaptive Setting
...and 31 more sections

Key Result

Theorem 4.4

Fix $d$ sufficiently large. There exists a class of MDPs $\mathcal{M}$ and policies $\Pi$ defining PE problems $(\Bar{s}, M, \mathcal{M}, \pi_M, \Pi)$ satisfying Assumption assumption:Realizability such that any sample-efficient learner better than $(1,1/2)$-sound using policy-induced or policy-free

Figures (4)

Figure 1: Left: Information can be erased in multiple directions: Consider the setting where information is being erased along the pink plane $\mathcal{N}$: the learner's queries $\phi(s_i, a_i)$ are shown in blue and the environment's responses $-\gamma \phi(s_i^+, \pi_M(s_i^+))$ are shown in black. The rows of $X$, $\phi(s_i, a_i) -\gamma \phi(s_i^+, \pi_M(s_i^+))$ (blue + black vectors) all lie on the yellow line so the learner acquires no information in the directions of the pink subspace $\mathcal{N}$, the null-space of $X$. Right: Information cannot be erased in all directions: Consider the opposite setting where information is being erased along the pink line $\mathcal{N}$. Because of the constraint $\|\phi(s,a)\|_2\leq 1$ and $\gamma < 1$, a query $\phi_1$ (in blue) in the pink cap cannot have $\phi_1 - \gamma \phi^+_1$ (blue + black vector) projected back onto the yellow plane (unless $\|\gamma \phi_1\|_2 > \gamma \implies \|\phi_1\|_2 > 1$). Despite the environment not being able to erase information in certain directions, if the number of queries is "small", it can always find directions to erase.
Figure 2: Illustration of a hyperspherical sector of a $1$-dimensional subspace (left) and a $2$-dimensional subspace (right - all vectors whose direction is within the two pink bands). In both cases, the subspace $H$ is in yellow.
Figure 3: Round 1.
Figure 4: Round 2.

Theorems & Definitions (21)

Definition 3.1
Definition 3.2
Definition 3.3: Query-Feedback
Remark 3.4
Definition 3.5: Policy-Induced Queries (Zanette_BatchRL, Definition 2)
Remark 3.6
Definition 4.3
Theorem 4.4
Theorem 4.5
Theorem B.2
...and 11 more

Sample-Efficiency in Multi-Batch Reinforcement Learning: The Need for Dimension-Dependent Adaptivity

TL;DR

Abstract

Sample-Efficiency in Multi-Batch Reinforcement Learning: The Need for Dimension-Dependent Adaptivity

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (21)