Table of Contents
Fetching ...

Regret Minimization via Saddle Point Optimization

Johannes Kirschner, Seyed Alireza Bakhtiari, Kushagra Chandak, Volodymyr Tkachuk, Csaba Szepesvári

TL;DR

This work derives an anytime variant of the Estimation-To-Decisions (E2D) algorithm that optimizes the exploration-exploitation trade-off online instead of via the analysis, and leads to a practical algorithm for finite model classes and linear feedback models.

Abstract

A long line of works characterizes the sample complexity of regret minimization in sequential decision-making by min-max programs. In the corresponding saddle-point game, the min-player optimizes the sampling distribution against an adversarial max-player that chooses confusing models leading to large regret. The most recent instantiation of this idea is the decision-estimation coefficient (DEC), which was shown to provide nearly tight lower and upper bounds on the worst-case expected regret in structured bandits and reinforcement learning. By re-parametrizing the offset DEC with the confidence radius and solving the corresponding min-max program, we derive an anytime variant of the Estimation-To-Decisions (E2D) algorithm. Importantly, the algorithm optimizes the exploration-exploitation trade-off online instead of via the analysis. Our formulation leads to a practical algorithm for finite model classes and linear feedback models. We further point out connections to the information ratio, decoupling coefficient and PAC-DEC, and numerically evaluate the performance of E2D on simple examples.

Regret Minimization via Saddle Point Optimization

TL;DR

This work derives an anytime variant of the Estimation-To-Decisions (E2D) algorithm that optimizes the exploration-exploitation trade-off online instead of via the analysis, and leads to a practical algorithm for finite model classes and linear feedback models.

Abstract

A long line of works characterizes the sample complexity of regret minimization in sequential decision-making by min-max programs. In the corresponding saddle-point game, the min-player optimizes the sampling distribution against an adversarial max-player that chooses confusing models leading to large regret. The most recent instantiation of this idea is the decision-estimation coefficient (DEC), which was shown to provide nearly tight lower and upper bounds on the worst-case expected regret in structured bandits and reinforcement learning. By re-parametrizing the offset DEC with the confidence radius and solving the corresponding min-max program, we derive an anytime variant of the Estimation-To-Decisions (E2D) algorithm. Importantly, the algorithm optimizes the exploration-exploitation trade-off online instead of via the analysis. Our formulation leads to a practical algorithm for finite model classes and linear feedback models. We further point out connections to the information ratio, decoupling coefficient and PAC-DEC, and numerically evaluate the performance of E2D on simple examples.
Paper Structure (15 sections, 8 theorems, 31 equations, 1 table, 2 algorithms)

This paper contains 15 sections, 8 theorems, 31 equations, 1 table, 2 algorithms.

Key Result

Theorem 1

Let $\lambda_t \geq 0$ be any sequence adapted to the filtration $\mathcal{F}_t$. Then the regret of Anytime-E2D (alg:e2d) with input sequence $\lambda_t$ satisfies for all $n \geq 1$: where we defined $\text{\normalfont dec}^{ac}_{\epsilon,\lambda}(f) = \min_{\mu \in \mathscr{P}(\Pi)} \max_{\nu \in \mathscr{P}(\mathcal{M})} \mu \Delta \nu - \lambda (\mu I_f \nu - \epsilon^2)$.

Theorems & Definitions (11)

  • Example 2.1: Linear Bandits, abe1999associative
  • Example 2.2: Linear Bandits with Side-Observations
  • Theorem 1
  • Corollary 1
  • proof : Proof of \ref{['thm:worst-case']}
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • ...and 1 more