Unified Algorithms for RL with Decision-Estimation Coefficients: PAC, Reward-Free, Preference-Based Learning, and Beyond

Fan Chen; Song Mei; Yu Bai

Unified Algorithms for RL with Decision-Estimation Coefficients: PAC, Reward-Free, Preference-Based Learning, and Beyond

Fan Chen, Song Mei, Yu Bai

TL;DR

This work proposes a unified DEC-based framework for reinforcement learning that extends beyond traditional no-regret objectives to encompass PAC learning, reward-free exploration, all-policy model estimation, and preference-based RL. Central to the framework is the generalized Decision-Estimation Coefficient ($ abla$G-DEC) and a corresponding meta-algorithm ($ abla$G-E2D), which instantiate goal-specific bounds by tuning a divergence-based trade-off between information gain and suboptimality. The paper provides PAC and regret guarantees, lower bounds, and concrete instantiations (RFDEC, AMDEC, PBDEC), all under the umbrella of decouplable representations that unify several existing structural conditions. It also connects these DEC-based guarantees to optimistic model-based methods such as MOPS and OMLE, showing that they admit similar sample complexity bounds under shared structural assumptions. The results recover prior findings and yield new guarantees across a wide range of model classes, including reward-free exploration, partially observable RL, and multi-agent settings, while highlighting directions for computational efficiency and potential gaps between bounds. Overall, the work advances a cohesive, theory-driven path toward understanding and achieving sample-efficient RL across diverse learning goals with a single unified framework.

Abstract

Modern Reinforcement Learning (RL) is more than just learning the optimal policy; Alternative learning goals such as exploring the environment, estimating the underlying model, and learning from preference feedback are all of practical importance. While provably sample-efficient algorithms for each specific goal have been proposed, these algorithms often depend strongly on the particular learning goal and thus admit different structures correspondingly. It is an urging open question whether these learning goals can rather be tackled by a single unified algorithm. We make progress on this question by developing a unified algorithm framework for a large class of learning goals, building on the Decision-Estimation Coefficient (DEC) framework. Our framework handles many learning goals such as no-regret RL, PAC RL, reward-free learning, model estimation, and preference-based learning, all by simply instantiating the same generic complexity measure called "Generalized DEC", and a corresponding generic algorithm. The generalized DEC also yields a sample complexity lower bound for each specific learning goal. As applications, we propose "decouplable representation" as a natural sufficient condition for bounding generalized DECs, and use it to obtain many new sample-efficient results (and recover existing results) for a wide range of learning goals and problem classes as direct corollaries. Finally, as a connection, we re-analyze two existing optimistic model-based algorithms based on Posterior Sampling and Maximum Likelihood Estimation, showing that they enjoy sample complexity bounds under similar structural conditions as the DEC.

Unified Algorithms for RL with Decision-Estimation Coefficients: PAC, Reward-Free, Preference-Based Learning, and Beyond

TL;DR

G-DEC) and a corresponding meta-algorithm (

G-E2D), which instantiate goal-specific bounds by tuning a divergence-based trade-off between information gain and suboptimality. The paper provides PAC and regret guarantees, lower bounds, and concrete instantiations (RFDEC, AMDEC, PBDEC), all under the umbrella of decouplable representations that unify several existing structural conditions. It also connects these DEC-based guarantees to optimistic model-based methods such as MOPS and OMLE, showing that they admit similar sample complexity bounds under shared structural assumptions. The results recover prior findings and yield new guarantees across a wide range of model classes, including reward-free exploration, partially observable RL, and multi-agent settings, while highlighting directions for computational efficiency and potential gaps between bounds. Overall, the work advances a cohesive, theory-driven path toward understanding and achieving sample-efficient RL across diverse learning goals with a single unified framework.

Abstract

Paper Structure (165 sections, 70 theorems, 431 equations, 2 figures, 2 tables, 12 algorithms)

This paper contains 165 sections, 70 theorems, 431 equations, 2 figures, 2 tables, 12 algorithms.

Introduction
Related work
Sample-efficient reinforcement learning
Decision-estimation coefficient
Other general algorithms
Reward-free learning, model estimation, and preference-based RL
Other problems covered by DMSO
Concurrent work
Subsequent works
Preliminaries
RL as Decision Making with Structured Observations
Learning goals
Divergences
DEC with randomized reference models
No-regret algorithm: E2D with Tempered Aggregation
...and 150 more sections

Key Result

Proposition 2.2

Choosing $\eta_{\mathrm{p}}=\eta_{\mathrm{r}}=1/3$, alg:E2D-TA achieves the following with probability at least $1-\delta$:

Figures (2)

Figure 1: A conceptual diagram of implications between various $\mathsf{G}$-DEC s and (strong) decouplable representation. PACDEC can be bounded by RFDEC, which can be further bounded by AMDEC; PACDEC and Regret DEC can be converted to each other; Regret DEC can be bounded by PBDEC. The implications between $\mathsf{G}$-DEC s are discussed in the corresponding sections (cf. \ref{['sec:gen-dec']}), and the bounds on $\mathsf{G}$-DEC s in terms of (strong) decouplable representation are presented in \ref{['section:bellman-rep']} (\ref{['prop:belrep-pac']}, \ref{['prop:belrep-am']}, \ref{['prop:belrep-rf']}, and \ref{['prop:belrep-pb']}).
Figure 2: Illustration of how the decouplable representation recovers existing generic structural conditions, including the model-based version of Bilinear class du2021bilinear, Bellman-Eluder dimension jin2021bellman, and stable PSR chen2022partially. As we discuss in \ref{['section:bellman-rep']}, strong decouplable representation also encompasses various concrete MDP model classes (see e.g. \ref{['tab:examples']}).

Theorems & Definitions (132)

Definition 2.1: DEC with randomized reference models
Proposition 2.2: Regret guarantee for E2D-TA
Definition 3.1: PACDEC
Theorem 3.2: PAC RL with PAC E2D
Proposition 3.3: Lower bound for PAC RL
Proposition 3.4: Relationship between PACDEC and Regret DEC
Proposition 3.5: Informal
Definition 4.1
Theorem 4.2
Theorem 4.3
...and 122 more

Unified Algorithms for RL with Decision-Estimation Coefficients: PAC, Reward-Free, Preference-Based Learning, and Beyond

TL;DR

Abstract

Unified Algorithms for RL with Decision-Estimation Coefficients: PAC, Reward-Free, Preference-Based Learning, and Beyond

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (132)