Table of Contents
Fetching ...

An Improved Model-Free Decision-Estimation Coefficient with Applications in Adversarial MDPs

Haolin Liu, Chen-Yu Wei, Julian Zimmert

TL;DR

This paper introduces Dig-DEC, a model-free decision-estimation coefficient framework that relies on information gain rather than optimism to drive exploration, enabling robust learning in adversarial environments. By embedding Dig-DEC within a general Phi-restricted AIR framework and defining a flexible PosteriorUpdate, the authors obtain regret bounds for model-free learning in both stochastic and hybrid MDPs, including the first such bounds for hybrid MDPs with bandit feedback. They further refine online function estimation, achieving sharper concentration for average estimation (from $T^{3/4}$ to $T^{2/3}$ on-policy and $T^{5/6}$ to $T^{7/9}$ off-policy) and for squared estimation (from $T^{2/3}$ to $ oot 2 rom T$ in Bellman-complete MDPs), thereby matching the performance of optimism-based methods in certain settings. The results offer improved, broadly applicable regret bounds across bilinear, Bellman-eluder, and coverable MDP classes and demonstrate Dig-DEC’s practical potential for adversarially changing rewards without requiring reward estimators, with implications for scalable model-free RL in complex environments.

Abstract

We study decision making with structured observation (DMSO). Previous work (Foster et al., 2021b, 2023a) has characterized the complexity of DMSO via the decision-estimation coefficient (DEC), but left a gap between the regret upper and lower bounds that scales with the size of the model class. To tighten this gap, Foster et al. (2023b) introduced optimistic DEC, achieving a bound that scales only with the size of the value-function class. However, their optimism-based exploration is only known to handle the stochastic setting, and it remains unclear whether it extends to the adversarial setting. We introduce Dig-DEC, a model-free DEC that removes optimism and drives exploration purely by information gain. Dig-DEC is always no larger than optimistic DEC and can be much smaller in special cases. Importantly, the removal of optimism allows it to handle adversarial environments without explicit reward estimators. By applying Dig-DEC to hybrid MDPs with stochastic transitions and adversarial rewards, we obtain the first model-free regret bounds for hybrid MDPs with bandit feedback under several general transition structures, resolving the main open problem left by Liu et al. (2025). We also improve the online function-estimation procedure in model-free learning: For average estimation error minimization, we refine the estimator in Foster et al. (2023b) to achieve sharper concentration, improving their regret bounds from $T^{3/4}$ to $T^{2/3}$ (on-policy) and from $T^{5/6}$ to $T^{7/9}$ (off-policy). For squared error minimization in Bellman-complete MDPs, we redesign their two-timescale procedure, improving the regret bound from $T^{2/3}$ to $\sqrt{T}$. This is the first time a DEC-based method achieves performance matching that of optimism-based approaches (Jin et al., 2021; Xie et al., 2023) in Bellman-complete MDPs.

An Improved Model-Free Decision-Estimation Coefficient with Applications in Adversarial MDPs

TL;DR

This paper introduces Dig-DEC, a model-free decision-estimation coefficient framework that relies on information gain rather than optimism to drive exploration, enabling robust learning in adversarial environments. By embedding Dig-DEC within a general Phi-restricted AIR framework and defining a flexible PosteriorUpdate, the authors obtain regret bounds for model-free learning in both stochastic and hybrid MDPs, including the first such bounds for hybrid MDPs with bandit feedback. They further refine online function estimation, achieving sharper concentration for average estimation (from to on-policy and to off-policy) and for squared estimation (from to in Bellman-complete MDPs), thereby matching the performance of optimism-based methods in certain settings. The results offer improved, broadly applicable regret bounds across bilinear, Bellman-eluder, and coverable MDP classes and demonstrate Dig-DEC’s practical potential for adversarially changing rewards without requiring reward estimators, with implications for scalable model-free RL in complex environments.

Abstract

We study decision making with structured observation (DMSO). Previous work (Foster et al., 2021b, 2023a) has characterized the complexity of DMSO via the decision-estimation coefficient (DEC), but left a gap between the regret upper and lower bounds that scales with the size of the model class. To tighten this gap, Foster et al. (2023b) introduced optimistic DEC, achieving a bound that scales only with the size of the value-function class. However, their optimism-based exploration is only known to handle the stochastic setting, and it remains unclear whether it extends to the adversarial setting. We introduce Dig-DEC, a model-free DEC that removes optimism and drives exploration purely by information gain. Dig-DEC is always no larger than optimistic DEC and can be much smaller in special cases. Importantly, the removal of optimism allows it to handle adversarial environments without explicit reward estimators. By applying Dig-DEC to hybrid MDPs with stochastic transitions and adversarial rewards, we obtain the first model-free regret bounds for hybrid MDPs with bandit feedback under several general transition structures, resolving the main open problem left by Liu et al. (2025). We also improve the online function-estimation procedure in model-free learning: For average estimation error minimization, we refine the estimator in Foster et al. (2023b) to achieve sharper concentration, improving their regret bounds from to (on-policy) and from to (off-policy). For squared error minimization in Bellman-complete MDPs, we redesign their two-timescale procedure, improving the regret bound from to . This is the first time a DEC-based method achieves performance matching that of optimism-based approaches (Jin et al., 2021; Xie et al., 2023) in Bellman-complete MDPs.

Paper Structure

This paper contains 45 sections, 40 theorems, 162 equations, 1 figure, 3 tables, 4 algorithms.

Key Result

Theorem 3

For $\Phi$-restricted environment defined in def: restricted env, there exists an algorithm ensuring $\mathbb{E}[\text{\rm Reg}(\pi_{\phi^\star})]\leq \mathbb{E}[\sum_{t}\min_p \max_\nu \mathsf{AIR}_\eta^\Phi(p,\nu;\rho_t)] + \frac{\log|\Phi|}{\eta}$.

Figures (1)

  • Figure 1: Partitioning for hybrid MDPs

Theorems & Definitions (81)

  • Definition 1: Infosets and $\Phi$ liu2025decisionchen2025decision
  • Definition 2: $\Phi$-resitricted environment liu2025decisionchen2025decision
  • Theorem 3: liu2025decision
  • Definition 4: Stochastic setting
  • Definition 5: Hybrid setting
  • Theorem 6
  • Theorem 7
  • Lemma 8
  • Definition 9: Bellman completeness for the stochastic setting
  • Definition 10: Bellman completeness for the hybrid setting
  • ...and 71 more