The Fallacy of Minimizing Cumulative Regret in the Sequential Task Setting

Ziping Xu; Kelly W. Zhang; Susan A. Murphy

The Fallacy of Minimizing Cumulative Regret in the Sequential Task Setting

Ziping Xu, Kelly W. Zhang, Susan A. Murphy

TL;DR

The results show that task non-stationarity leads to a more restrictive trade-off between CR and SR, indicating that increased exploration is necessary in non-stationary environments to accommodate task changes, impacting the design of RL algorithms in fields such as healthcare and beyond.

Abstract

Online Reinforcement Learning (RL) is typically framed as the process of minimizing cumulative regret (CR) through interactions with an unknown environment. However, real-world RL applications usually involve a sequence of tasks, and the data collected in the first task is used to warm-start the second task. The performance of the warm-start policy is measured by simple regret (SR). While minimizing both CR and SR is generally a conflicting objective, previous research has shown that in stationary environments, both can be optimized in terms of the duration of the task, $T$. In practice, however, in real-world applications, human-in-the-loop decisions between tasks often results in non-stationarity. For instance, in clinical trials, scientists may adjust target health outcomes between implementations. Our results show that task non-stationarity leads to a more restrictive trade-off between CR and SR. To balance these competing goals, the algorithm must explore excessively, leading to a CR bound worse than the typical optimal rate of $T^{1/2}$. These findings are practically significant, indicating that increased exploration is necessary in non-stationary environments to accommodate task changes, impacting the design of RL algorithms in fields such as healthcare and beyond.

The Fallacy of Minimizing Cumulative Regret in the Sequential Task Setting

TL;DR

Abstract

. In practice, however, in real-world applications, human-in-the-loop decisions between tasks often results in non-stationarity. For instance, in clinical trials, scientists may adjust target health outcomes between implementations. Our results show that task non-stationarity leads to a more restrictive trade-off between CR and SR. To balance these competing goals, the algorithm must explore excessively, leading to a CR bound worse than the typical optimal rate of

. These findings are practically significant, indicating that increased exploration is necessary in non-stationary environments to accommodate task changes, impacting the design of RL algorithms in fields such as healthcare and beyond.

Paper Structure (29 sections, 7 theorems, 45 equations, 3 figures, 4 tables)

This paper contains 29 sections, 7 theorems, 45 equations, 3 figures, 4 tables.

Introduction
Human-in-the-loop between tasks.
Main contribution.
Problem Setup
Notations.
Two-task contextual bandit learning paradigm.
Motivations for changes between tasks.
Minimax Lower Bound
Problem instance.
Discussion.
Lower bound construction
Case of $\Pi^{(1)} \neq \Pi^{(2)}$: add a new feature.
Case of $f^{(1)} \neq f^{(2)}$: change reward mapping.
Optimal Level of Exploration
Study on Changes in $P$
...and 14 more sections

Key Result

Theorem 1

For the policies spaces $\Pi^{(1)} = \{\pi: \pi(\cdot \mid x_1) = \pi(\cdot \mid x_2), \text{for all } x_1, x_2 \in {\mathcal{X}}\}$ and $\Pi^{(2)} = \Pi$, there exists an instance set ${\mathcal{P}}$, and reward mapping $f^{(1)} = f^{(2)}$, such that, Similarly, for some $f^{(1)} \neq f^{(2)}$, there exists exists an instance set ${\mathcal{P}}$ and $\Pi^{(1)} = \Pi^{(2)} = \Pi$, such that

Figures (3)

Figure 1: Two-task learning paradigm
Figure 2: Experiment with changes in policy spaces
Figure 3: Experiment with changes in reward mappings

Theorems & Definitions (17)

Definition 1: Cumulative regret
Definition 2: Simple regret
Remark 1
Definition 3: Occupancy measure
Definition 4: Learning algorithm
Theorem 1
Proposition 1
Theorem 2
Proposition 2
Theorem 3
...and 7 more

The Fallacy of Minimizing Cumulative Regret in the Sequential Task Setting

TL;DR

Abstract

The Fallacy of Minimizing Cumulative Regret in the Sequential Task Setting

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (17)