Table of Contents
Fetching ...

Towards a Sharp Analysis of Offline Policy Learning for $f$-Divergence-Regularized Contextual Bandits

Qingyue Zhao, Kaixuan Ji, Heyang Zhao, Tong Zhang, Quanquan Gu

TL;DR

It is shown that the sharp sample complexity of $\tilde{\Theta}(\epsilon^{-1})$ is achievable even without pessimistic estimation or single-policy concentrability, and a near-matching lower bound is proposed, demonstrating that a multiplicative dependency on single-policy concentrability is necessary to maximally exploit the curvature property of reverse KL.

Abstract

Many offline reinforcement learning algorithms are underpinned by $f$-divergence regularization, but their sample complexity *defined with respect to regularized objectives* still lacks tight analyses, especially in terms of concrete data coverage conditions. In this paper, we study the exact concentrability requirements to achieve the $\tildeΘ(ε^{-1})$ sample complexity for offline $f$-divergence-regularized contextual bandits. For reverse Kullback-Leibler (KL) divergence, arguably the most commonly used one, we achieve an $\tilde{O}(ε^{-1})$ sample complexity under single-policy concentrability for the first time via a novel pessimism-based analysis, surpassing existing $\tilde{O}(ε^{-1})$ bound under all-policy concentrability and $\tilde{O}(ε^{-2})$ bound under single-policy concentrability. We also propose a near-matching lower bound, demonstrating that a multiplicative dependency on single-policy concentrability is necessary to maximally exploit the curvature property of reverse KL. Moreover, for $f$-divergences with strongly convex $f$, to which reverse KL *does not* belong, we show that the sharp sample complexity $\tildeΘ(ε^{-1})$ is achievable even without pessimistic estimation or single-policy concentrability. We further corroborate our theoretical insights with numerical experiments and extend our analysis to contextual dueling bandits. We believe these results take a significant step towards a comprehensive understanding of objectives with $f$-divergence regularization.

Towards a Sharp Analysis of Offline Policy Learning for $f$-Divergence-Regularized Contextual Bandits

TL;DR

It is shown that the sharp sample complexity of is achievable even without pessimistic estimation or single-policy concentrability, and a near-matching lower bound is proposed, demonstrating that a multiplicative dependency on single-policy concentrability is necessary to maximally exploit the curvature property of reverse KL.

Abstract

Many offline reinforcement learning algorithms are underpinned by -divergence regularization, but their sample complexity *defined with respect to regularized objectives* still lacks tight analyses, especially in terms of concrete data coverage conditions. In this paper, we study the exact concentrability requirements to achieve the sample complexity for offline -divergence-regularized contextual bandits. For reverse Kullback-Leibler (KL) divergence, arguably the most commonly used one, we achieve an sample complexity under single-policy concentrability for the first time via a novel pessimism-based analysis, surpassing existing bound under all-policy concentrability and bound under single-policy concentrability. We also propose a near-matching lower bound, demonstrating that a multiplicative dependency on single-policy concentrability is necessary to maximally exploit the curvature property of reverse KL. Moreover, for -divergences with strongly convex , to which reverse KL *does not* belong, we show that the sharp sample complexity is achievable even without pessimistic estimation or single-policy concentrability. We further corroborate our theoretical insights with numerical experiments and extend our analysis to contextual dueling bandits. We believe these results take a significant step towards a comprehensive understanding of objectives with -divergence regularization.

Paper Structure

This paper contains 47 sections, 28 theorems, 112 equations, 3 figures, 1 table.

Key Result

Lemma 2.9

For all $\delta > 0$, $\mathcal{E}(\delta)$ holds with probability at least $1-\delta$.

Figures (3)

  • Figure 1: The empirical relation between $\log_2 n$ and $\log_2 \mathrm{SubOpt}$. The fitted rate means the slope of $\log_2 n \sim \log_2 \mathrm{SubOpt}$ estimated via linear regression. Here $n$ is the sample size. Every point is the average over 100 independent trials.
  • Figure 2: The empirical relation between $\log_2 n$ and $\log_2 \mathrm{SubOpt}$ for linear bandits. In the legend, we denote $C^{\pi^*}$ (resp. $D^2_{\pi^*}$) by C (resp. D2).
  • Figure 3: The empirical relation between $\log_2 n$ and $\log_2 \mathrm{SubOpt}$ on MNIST dataset.

Theorems & Definitions (39)

  • Definition 2.2: $\epsilon$-net and covering number
  • Definition 2.4: Density-ratio-based concentrability
  • Definition 2.5
  • Remark 2.6
  • Lemma 2.9
  • Theorem 2.10
  • Theorem 2.11
  • Remark 2.12
  • Remark 2.13
  • Lemma 2.14
  • ...and 29 more