Regret Minimization via Saddle Point Optimization

Johannes Kirschner; Seyed Alireza Bakhtiari; Kushagra Chandak; Volodymyr Tkachuk; Csaba Szepesvári

Regret Minimization via Saddle Point Optimization

Johannes Kirschner, Seyed Alireza Bakhtiari, Kushagra Chandak, Volodymyr Tkachuk, Csaba Szepesvári

TL;DR

This work derives an anytime variant of the Estimation-To-Decisions (E2D) algorithm that optimizes the exploration-exploitation trade-off online instead of via the analysis, and leads to a practical algorithm for finite model classes and linear feedback models.

Abstract

A long line of works characterizes the sample complexity of regret minimization in sequential decision-making by min-max programs. In the corresponding saddle-point game, the min-player optimizes the sampling distribution against an adversarial max-player that chooses confusing models leading to large regret. The most recent instantiation of this idea is the decision-estimation coefficient (DEC), which was shown to provide nearly tight lower and upper bounds on the worst-case expected regret in structured bandits and reinforcement learning. By re-parametrizing the offset DEC with the confidence radius and solving the corresponding min-max program, we derive an anytime variant of the Estimation-To-Decisions (E2D) algorithm. Importantly, the algorithm optimizes the exploration-exploitation trade-off online instead of via the analysis. Our formulation leads to a practical algorithm for finite model classes and linear feedback models. We further point out connections to the information ratio, decoupling coefficient and PAC-DEC, and numerically evaluate the performance of E2D on simple examples.

Regret Minimization via Saddle Point Optimization

TL;DR

Abstract

Paper Structure (15 sections, 8 theorems, 31 equations, 1 table, 2 algorithms)

This paper contains 15 sections, 8 theorems, 31 equations, 1 table, 2 algorithms.

Introduction
Contributions
Related Work
Setting
Regret Minimization via Saddle-Point Optimization
The Decision-Estimation Coefficient
Anytime Estimation-To-Decisions (Anytime-E2D)
Certifying Upper Bounds
Upper Bounds via Decoupling
PAC to Regret
Application to Linear Feedback Models
Computational Aspects
Conclusion
Online Density Estimation
Bounding the Estimation Error of Projected Regularized Least-Squares

Key Result

Theorem 1

Let $\lambda_t \geq 0$ be any sequence adapted to the filtration $\mathcal{F}_t$. Then the regret of Anytime-E2D (alg:e2d) with input sequence $\lambda_t$ satisfies for all $n \geq 1$: where we defined $\text{\normalfont dec}^{ac}_{\epsilon,\lambda}(f) = \min_{\mu \in \mathscr{P}(\Pi)} \max_{\nu \in \mathscr{P}(\mathcal{M})} \mu \Delta \nu - \lambda (\mu I_f \nu - \epsilon^2)$.

Theorems & Definitions (11)

Example 2.1: Linear Bandits, abe1999associative
Example 2.2: Linear Bandits with Side-Observations
Theorem 1
Corollary 1
proof : Proof of \ref{['thm:worst-case']}
Lemma 1
Lemma 2
Lemma 3
Lemma 4
Lemma 5
...and 1 more

Regret Minimization via Saddle Point Optimization

TL;DR

Abstract

Regret Minimization via Saddle Point Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (11)