Table of Contents
Fetching ...

Model Selection for Average Reward RL with Application to Utility Maximization in Repeated Games

Alireza Masoumian, James R. Wright

TL;DR

This work proposes and applies the MRBEAR algorithm, an online model selection algorithm for the average reward RL setting, to the interaction between a learner and an opponent in a two-player simultaneous general-sum repeated game, where the opponent follows a fixed unknown limited memory strategy.

Abstract

In standard RL, a learner attempts to learn an optimal policy for a Markov Decision Process whose structure (e.g. state space) is known. In online model selection, a learner attempts to learn an optimal policy for an MDP knowing only that it belongs to one of $M >1$ model classes of varying complexity. Recent results have shown that this can be feasibly accomplished in episodic online RL. In this work, we propose $\mathsf{MRBEAR}$, an online model selection algorithm for the average reward RL setting. The regret of the algorithm is in $\tilde O(M C_{m^*}^2 \mathsf{B}_{m^*}(T,δ))$ where $C_{m^*}$ represents the complexity of the simplest well-specified model class and $\mathsf{B}_{m^*}(T,δ)$ is its corresponding regret bound. This result shows that in average reward RL, like the episodic online RL, the additional cost of model selection scales only linearly in $M$, the number of model classes. We apply $\mathsf{MRBEAR}$ to the interaction between a learner and an opponent in a two-player simultaneous general-sum repeated game, where the opponent follows a fixed unknown limited memory strategy. The learner's goal is to maximize its utility without knowing the opponent's utility function. The interaction is over $T$ rounds with no episode or discounting which leads us to measure the learner's performance by average reward regret. In this application, our algorithm enjoys an opponent-complexity-dependent regret in $\tilde O(M(\mathsf{sp}(h^*) B^{m^*} A^{m^*+1})^{\frac{3}{2}} \sqrt{T})$, where $m^*\le M$ is the unknown memory limit of the opponent, $\mathsf{sp}(h^*)$ is the unknown span of optimal bias induced by the opponent, and $A$ and $B$ are the number of actions for the learner and opponent respectively. We also show that the exponential dependency on $m^*$ is inevitable by proving a lower bound on the learner's regret.

Model Selection for Average Reward RL with Application to Utility Maximization in Repeated Games

TL;DR

This work proposes and applies the MRBEAR algorithm, an online model selection algorithm for the average reward RL setting, to the interaction between a learner and an opponent in a two-player simultaneous general-sum repeated game, where the opponent follows a fixed unknown limited memory strategy.

Abstract

In standard RL, a learner attempts to learn an optimal policy for a Markov Decision Process whose structure (e.g. state space) is known. In online model selection, a learner attempts to learn an optimal policy for an MDP knowing only that it belongs to one of model classes of varying complexity. Recent results have shown that this can be feasibly accomplished in episodic online RL. In this work, we propose , an online model selection algorithm for the average reward RL setting. The regret of the algorithm is in where represents the complexity of the simplest well-specified model class and is its corresponding regret bound. This result shows that in average reward RL, like the episodic online RL, the additional cost of model selection scales only linearly in , the number of model classes. We apply to the interaction between a learner and an opponent in a two-player simultaneous general-sum repeated game, where the opponent follows a fixed unknown limited memory strategy. The learner's goal is to maximize its utility without knowing the opponent's utility function. The interaction is over rounds with no episode or discounting which leads us to measure the learner's performance by average reward regret. In this application, our algorithm enjoys an opponent-complexity-dependent regret in , where is the unknown memory limit of the opponent, is the unknown span of optimal bias induced by the opponent, and and are the number of actions for the learner and opponent respectively. We also show that the exponential dependency on is inevitable by proving a lower bound on the learner's regret.

Paper Structure

This paper contains 31 sections, 28 theorems, 105 equations, 1 algorithm.

Key Result

Proposition 2.3

(boone2024achieving, Theorem 5) Let $c_{h^*}>0$. Assume that $\textsf{PMEVI-DT}$ runs with proper confidence regions $\boldsymbol{\mathcal{M}}_t$, If $T \geq c_{h^*}^5$, then for every weakly communicating model with $\mathsf{sp}(h^*) \leq c_{h^*}$, $\textsf{PMEVI-DT}$ achieves the following regret where $c_0$ is a universal constant.

Theorems & Definitions (49)

  • Definition 2.1
  • Definition 2.2
  • Proposition 2.3
  • Definition 2.4
  • Theorem 4.1: Main theorem
  • Lemma 4.2
  • Lemma 4.3
  • Lemma 4.4
  • Definition 5.1
  • Remark 5.4
  • ...and 39 more