Sample-Efficiency in Multi-Batch Reinforcement Learning: The Need for Dimension-Dependent Adaptivity
Emmeran Johnson, Ciara Pike-Burke, Patrick Rebeschini
TL;DR
The paper investigates the trade-off between adaptivity and sample-efficiency in reinforcement learning under a multi-batch data-collection model, focusing on infinite-horizon discounted MDPs with linear function approximation. It proves an $oldsymbol{Ω(\, ext{log log } d)}$ lower bound on the number of batches needed to achieve sample-efficient learning for both PE and BPI, showing that mere adaptivity ($K>1$) is insufficient and that the boundary scales with the dimension $d$. The authors extend Zanette BatchRL's offline framework to the multi-batch setting and employ subspace-packing and hyper-spherical sector techniques to construct hard MDP instances where information can be erased along multiple dimensions across batches. They provide both policy-induced and policy-free lower bounds, discuss implications for low-adaptivity design, and outline open questions regarding potential upper bounds and the tightness of the $Ω(\log \log d)$ dependence. Overall, the work reshapes our understanding of how adaptivity and dimension interact to govern sample-efficiency in RL, guiding the design of low-adaptivity algorithms in high-dimensional settings.
Abstract
We theoretically explore the relationship between sample-efficiency and adaptivity in reinforcement learning. An algorithm is sample-efficient if it uses a number of queries $n$ to the environment that is polynomial in the dimension $d$ of the problem. Adaptivity refers to the frequency at which queries are sent and feedback is processed to update the querying strategy. To investigate this interplay, we employ a learning framework that allows sending queries in $K$ batches, with feedback being processed and queries updated after each batch. This model encompasses the whole adaptivity spectrum, ranging from non-adaptive 'offline' ($K=1$) to fully adaptive ($K=n$) scenarios, and regimes in between. For the problems of policy evaluation and best-policy identification under $d$-dimensional linear function approximation, we establish $Ω(\log \log d)$ lower bounds on the number of batches $K$ required for sample-efficient algorithms with $n = O(poly(d))$ queries. Our results show that just having adaptivity ($K>1$) does not necessarily guarantee sample-efficiency. Notably, the adaptivity-boundary for sample-efficiency is not between offline reinforcement learning ($K=1$), where sample-efficiency was known to not be possible, and adaptive settings. Instead, the boundary lies between different regimes of adaptivity and depends on the problem dimension.
