Oracle-Efficient Reinforcement Learning for Max Value Ensembles

Marcel Hussing; Michael Kearns; Aaron Roth; Sikata Bela Sengupta; Jessica Sorrell

Oracle-Efficient Reinforcement Learning for Max Value Ensembles

Marcel Hussing, Michael Kearns, Aaron Roth, Sikata Bela Sengupta, Jessica Sorrell

TL;DR

The paper tackles reinforcement learning in large state spaces by leveraging an ensemble of $K$ constituent policies and aiming to beat the max-following benchmark. It presents MaxIteration, an oracle-efficient algorithm that uses a squared-error regression oracle for constituent value functions defined on samplable distributions to iteratively build a max-following-competitive policy over a horizon $H$. The authors prove guarantees with $O(HK)$ oracle calls, introducing an approximate max-following benchmark and bounding bad trajectories to ensure performance close to the best constituent policy within $O(\\varepsilon)$. Empirical results on CompoSuite and DM Control demonstrate that MaxIteration can outperform individual constituents and is robust to limited data budgets, highlighting its practical potential for scalable policy improvement from existing skills. Overall, the work provides a principled, regression-oracle-based path to ensemble-based RL that scales with state space size while maintaining competitive performance.

Abstract

Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, both theoretically (where worst-case sample and computational complexities must scale with state space cardinality) and experimentally (where function approximation and policy gradient techniques often scale poorly and suffer from instability and high variance). One line of research attempting to address these difficulties makes the natural assumption that we are given a collection of heuristic base or $\textit{constituent}$ policies upon which we would like to improve in a scalable manner. In this work we aim to compete with the $\textit{max-following policy}$, which at each state follows the action of whichever constituent policy has the highest value. The max-following policy is always at least as good as the best constituent policy, and may be considerably better. Our main result is an efficient algorithm that learns to compete with the max-following policy, given only access to the constituent policies (but not their value functions). In contrast to prior work in similar settings, our theoretical results require only the minimal assumption of an ERM oracle for value function approximation for the constituent policies (and not the global optimal policy or the max-following policy itself) on samplable distributions. We illustrate our algorithm's experimental effectiveness and behavior on several robotic simulation testbeds.

Oracle-Efficient Reinforcement Learning for Max Value Ensembles

TL;DR

The paper tackles reinforcement learning in large state spaces by leveraging an ensemble of

constituent policies and aiming to beat the max-following benchmark. It presents MaxIteration, an oracle-efficient algorithm that uses a squared-error regression oracle for constituent value functions defined on samplable distributions to iteratively build a max-following-competitive policy over a horizon

. The authors prove guarantees with

oracle calls, introducing an approximate max-following benchmark and bounding bad trajectories to ensure performance close to the best constituent policy within

. Empirical results on CompoSuite and DM Control demonstrate that MaxIteration can outperform individual constituents and is robust to limited data budgets, highlighting its practical potential for scalable policy improvement from existing skills. Overall, the work provides a principled, regression-oracle-based path to ensemble-based RL that scales with state space size while maintaining competitive performance.

Abstract

policies upon which we would like to improve in a scalable manner. In this work we aim to compete with the

, which at each state follows the action of whichever constituent policy has the highest value. The max-following policy is always at least as good as the best constituent policy, and may be considerably better. Our main result is an efficient algorithm that learns to compete with the max-following policy, given only access to the constituent policies (but not their value functions). In contrast to prior work in similar settings, our theoretical results require only the minimal assumption of an ERM oracle for value function approximation for the constituent policies (and not the global optimal policy or the max-following policy itself) on samplable distributions. We illustrate our algorithm's experimental effectiveness and behavior on several robotic simulation testbeds.

Paper Structure (18 sections, 4 theorems, 22 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 4 theorems, 22 equations, 5 figures, 2 tables, 1 algorithm.

Introduction
Results
Related work
Preliminaries
The $\mathsf{MaxIteration}$ learning algorithm
The approximate max-following benchmark
Experiments
Experimental Results
Conclusion
Limitations and Future Work
MDP Examples
LQR max-following parametric class vs. constituent policies
Additional Proofs
Additional information about experiments
Hyperparameters
...and 3 more sections

Key Result

Theorem 3.1

For any $\varepsilon \in (0,1]$, any MDP $\mathcal{M}$ with starting state distribution $\mu_0$, any episode length $H$, and any $K$ policies $\Pi^k$ defined on $\mathcal{M}$, let $\alpha \in \Theta(\tfrac{\varepsilon^3}{KH^4})$ and $\beta \in \Theta(\tfrac{\varepsilon}{H})$. Then $\mathsf{MaxIterat

Figures (5)

Figure 1: Examples of MDPs with max-following policy performance comparison
Figure 2: Examples for Observation \ref{['obs:approx-tie-breaking']} and Observation \ref{['obs:value-class']}
Figure 3: Mean cumulative return and success over $5$ seeds of $\mathsf{MaxIteration}$ compared to fine-tuning IQL on selected tasks. Error-bars correspond to standard error. Full bars correspond to returns and red lines indicate the success rate of each algorithm. $\mathsf{MaxIteration}$ can yield improvements in return but increased return does not always yield success.
Figure 4: Mean cumulative return and success over $5$ seeds of $\mathsf{MaxIteration}$ compared to fine-tuning IQL on all considered tasks. Error-bars correspond to standard error. Full bars correspond to returns and red lines indicate the success rate of each algorithm.
Figure 5: Mean of cumulative return over $5$ seeds of $\mathsf{MaxIteration}$ on DM Control tasks tunyasuvunakool2020dmcontrol. Error-bars correspond to standard error. MaxIteration always selects the best performing constituent policy.

Theorems & Definitions (9)

Definition 2.1: Max-following policy class
Definition 2.2: Oracle for $\pi$ value function estimates
Definition 2.3: Approximate max-following policies
Theorem 3.1
proof
Lemma 4.0: Worst approximate max-following policy competes with best fixed policy
Corollary 4.1
Lemma B.0: Worst approximate max-following policy competes with best fixed policy
proof

Oracle-Efficient Reinforcement Learning for Max Value Ensembles

TL;DR

Abstract

Oracle-Efficient Reinforcement Learning for Max Value Ensembles

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (9)