Table of Contents
Fetching ...

Efficient Algorithms for Logistic Contextual Slate Bandits with Bandit Feedback

Tanmay Goyal, Gaurav Sinha

TL;DR

This work proposes two algorithms, Slate-GLM-OFU and Slate-GLM-TS, that achieve per-round time complexity via local planning, and low regret through global learning (joint parameter estimation), and provides theoretical and empirical evidence supporting these claims.

Abstract

We study the Logistic Contextual Slate Bandit problem, where, at each round, an agent selects a slate of $N$ items from an exponentially large set (of size $2^{Ω(N)}$) of candidate slates provided by the environment. A single binary reward, determined by a logistic model, is observed for the chosen slate. Our objective is to develop algorithms that maximize cumulative reward over $T$ rounds while maintaining low per-round computational costs. We propose two algorithms, Slate-GLM-OFU and Slate-GLM-TS, that accomplish this goal. These algorithms achieve $N^{O(1)}$ per-round time complexity via local planning (independent slot selections), and low regret through global learning (joint parameter estimation). We provide theoretical and empirical evidence supporting these claims. Under a well-studied diversity assumption, we prove that Slate-GLM-OFU incurs only $\tilde{O}(\sqrt{T})$ regret. Extensive experiments across a wide range of synthetic settings demonstrate that our algorithms consistently outperform state-of-the-art baselines, achieving both the lowest regret and the fastest runtime. Furthermore, we apply our algorithm to select in-context examples in prompts of Language Models for solving binary classification tasks such as sentiment analysis. Our approach achieves competitive test accuracy, making it a viable alternative in practical scenarios.

Efficient Algorithms for Logistic Contextual Slate Bandits with Bandit Feedback

TL;DR

This work proposes two algorithms, Slate-GLM-OFU and Slate-GLM-TS, that achieve per-round time complexity via local planning, and low regret through global learning (joint parameter estimation), and provides theoretical and empirical evidence supporting these claims.

Abstract

We study the Logistic Contextual Slate Bandit problem, where, at each round, an agent selects a slate of items from an exponentially large set (of size ) of candidate slates provided by the environment. A single binary reward, determined by a logistic model, is observed for the chosen slate. Our objective is to develop algorithms that maximize cumulative reward over rounds while maintaining low per-round computational costs. We propose two algorithms, Slate-GLM-OFU and Slate-GLM-TS, that accomplish this goal. These algorithms achieve per-round time complexity via local planning (independent slot selections), and low regret through global learning (joint parameter estimation). We provide theoretical and empirical evidence supporting these claims. Under a well-studied diversity assumption, we prove that Slate-GLM-OFU incurs only regret. Extensive experiments across a wide range of synthetic settings demonstrate that our algorithms consistently outperform state-of-the-art baselines, achieving both the lowest regret and the fastest runtime. Furthermore, we apply our algorithm to select in-context examples in prompts of Language Models for solving binary classification tasks such as sentiment analysis. Our approach achieves competitive test accuracy, making it a viable alternative in practical scenarios.

Paper Structure

This paper contains 21 sections, 30 theorems, 156 equations, 3 figures, 2 tables, 4 algorithms.

Key Result

Theorem 3.1

Let $\mathcal{T} = \{s\in [T]: (\mathbf{x}_s, y_s)\in \mathcal{H}_T\}$, i.e, the set of rounds up till round $T$ where the inequality condition in equation:adaptivity-criterion fails. Also, let $\mathbf{x}_{\star, t} = \mathop{\mathrm{arg\,max}}\limits_{\mathbf{x}\in \mathcal{X}_t} \mu(\mathbf{x}^\t

Figures (3)

  • Figure 1:
  • Figure 2:
  • Figure 3: Demonstration of the algorithm-dependent assumption for Slate-GLM-OFU and Slate-GLM-TS wherein we plot the minimum eigenvalues of $\bm{W}^i_t$ as a function of the time round for 100 independent runs

Theorems & Definitions (56)

  • Theorem 3.1: Regret of Slate-GLM-OFU
  • Claim A.1
  • Definition A.1
  • Theorem B.1: Regret of Slate-GLM-OFU
  • proof
  • Lemma B.1
  • proof
  • Proposition B.1
  • Proposition B.2
  • Lemma B.2
  • ...and 46 more