On Bits and Bandits: Quantifying the Regret-Information Trade-off

Itai Shufaro; Nadav Merlis; Nir Weinberger; Shie Mannor

On Bits and Bandits: Quantifying the Regret-Information Trade-off

Itai Shufaro, Nadav Merlis, Nir Weinberger, Shie Mannor

TL;DR

This work formalizes the regret-information trade-off in contextual, Bayesian online decision-making. It introduces a general Fano-based method to derive worst-case Bayesian regret lower bounds and develops information-theoretic upper and lower bounds that tie regret to the amount of information an agent accumulates, via mutual information and entropy constraints. The authors provide both finite- and infinite-decision-space results, including bounds for contextual MAB and linear bandits, and demonstrate the practical utility of the framework with experiments on Bayesian bandits and LLM-assisted question answering. Overall, the paper offers principled tools to quantify the value of external information and to design information-aware strategies for reducing regret in sequential decision tasks.

Abstract

In many sequential decision problems, an agent performs a repeated task. He then suffers regret and obtains information that he may use in the following rounds. However, sometimes the agent may also obtain information and avoid suffering regret by querying external sources. We study the trade-off between the information an agent accumulates and the regret it suffers. We invoke information-theoretic methods for obtaining regret lower bounds, that also allow us to easily re-derive several known lower bounds. We introduce the first Bayesian regret lower bounds that depend on the information an agent accumulates. We also prove regret upper bounds using the amount of information the agent accumulates. These bounds show that information measured in bits, can be traded off for regret, measured in reward. Finally, we demonstrate the utility of these bounds in improving the performance of a question-answering task with large language models, allowing us to obtain valuable insights.

On Bits and Bandits: Quantifying the Regret-Information Trade-off

TL;DR

Abstract

Paper Structure (35 sections, 21 theorems, 60 equations, 4 figures, 5 tables)

This paper contains 35 sections, 21 theorems, 60 equations, 4 figures, 5 tables.

Introduction
Contributions
Setting and Preliminaries
Regret Lower Bounds Using Fano's Inequality
Information Theoretic Bayesian Regret Upper and Lower Bounds
Mutual Information Constraint
Entropy Constraint
Experiments
Stochastic Bayesian Bandit
Question Answering with Large Language Models
Related Work
Conclusions and Future Work
Table of Notations
Useful Properties of Covering Numbers
Proofs for Section \ref{['sec:fano_bounds']}
...and 20 more sections

Key Result

Theorem 3.1

Let $X,Y \sim Q$ be two jointly distributed random variables, where $X$ can take values over a finite set, whose cardinality is $\mathcal{X}$. Let $\hat{X} = f(Y)$ for some $f$ be an estimator of $X$. If $\hat{X}$ is uniformly distributed over all possible values in $\mathcal{X}$, then the following

Figures (4)

Figure 1: Schematic of a general interactive decision-making task with contextual information. Every round, a source of knowledge provides our agent with information. The agent then makes a decision, that causes the task to generate an observation and a reward. The observation is revealed to the agent, who updates his next decision according to the past observations and the information he received.
Figure 2: (a) The Bayesian regret and (b) the accumulated information in bits for three different bandit algorithms, under the same bandit structure and a uniform prior. The bandit algorithms used are described in the legend. The shadowed areas correspond to 2-sigma error bars.
Figure 3: (a) The Bayesian regret and (b) the accumulated information in bits for Thompson sampling under three different priors over the same bandit structure. The entropy of each prior is described in the legend. The shadowed areas correspond to 2-sigma error bars.
Figure 4: The prompting done for the LLMs in our experiments. The question is preceded by a question prompt followed by "Answer the following question:". Following this, the options are presented. <Question prompt> and <End prompt> are both replaced by a different prompt for every model.

Theorems & Definitions (36)

Example 2.1: Contextual MAB with Bernoulli rewards
Example 2.2: Tabular reinforcement learning with a finite horizon
Theorem 3.1: Fano's inequality, Theorem 2.10 of cover1999elements
Proposition 3.2
Theorem 3.3: yang1999information
Theorem 3.4
Proposition 4.1
Proposition 4.2
Proposition 4.3
Theorem 4.4: Propositions 2 and 4 of russo2014informationdirected
...and 26 more

On Bits and Bandits: Quantifying the Regret-Information Trade-off

TL;DR

Abstract

On Bits and Bandits: Quantifying the Regret-Information Trade-off

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (36)