Table of Contents
Fetching ...

Combinatorial Logistic Bandits

Xutong Liu, Xiangxiang Dai, Xuchuang Wang, Mohammad Hajiesmaili, John C. S. Lui

TL;DR

This work introduces Combinatorial Logistic Bandits (CLogB), a framework for contextual CMAB with binary base-arm outcomes modeled by a logistic link and probabilistic triggering. It develops three algorithmic families—CLogUCB (variance-agnostic), VA-CLogUCB (variance-adaptive), and EVA-CLogUCB (efficient with burn-in)—and derives regret bounds under 1-norm TPM and TPVM smoothness, achieving leading scalings of $\tilde{O}(d\sqrt{\kappa KT})$ and $\tilde{O}(d\sqrt{T})$ respectively. The methods combine maximum likelihood estimation with novel variance-aware confidence regions, project MLE to manage exploration bonuses, and, for EVA-CLogUCB, incorporate a burn-in stage to remove nonconvex projection overhead while preserving tight regret. Empirical results on synthetic cascading bandits, online content delivery, and real-world PMC data demonstrate substantial regret reductions and practical viability, illustrating the framework's applicability to large-scale nonlinear binary-outcome problems. The work advances scalable, nonlinear contextual bandits with combinatorial actions and binary feedback, with meaningful implications for online content ranking, CDN optimization, and network resource allocation.

Abstract

We introduce a novel framework called combinatorial logistic bandits (CLogB), where in each round, a subset of base arms (called the super arm) is selected, with the outcome of each base arm being binary and its expectation following a logistic parametric model. The feedback is governed by a general arm triggering process. Our study covers CLogB with reward functions satisfying two smoothness conditions, capturing application scenarios such as online content delivery, online learning to rank, and dynamic channel allocation. We first propose a simple yet efficient algorithm, CLogUCB, utilizing a variance-agnostic exploration bonus. Under the 1-norm triggering probability modulated (TPM) smoothness condition, CLogUCB achieves a regret bound of $\tilde{O}(d\sqrt{κKT})$, where $\tilde{O}$ ignores logarithmic factors, $d$ is the dimension of the feature vector, $κ$ represents the nonlinearity of the logistic model, and $K$ is the maximum number of base arms a super arm can trigger. This result improves on prior work by a factor of $\tilde{O}(\sqrtκ)$. We then enhance CLogUCB with a variance-adaptive version, VA-CLogUCB, which attains a regret bound of $\tilde{O}(d\sqrt{KT})$ under the same 1-norm TPM condition, improving another $\tilde{O}(\sqrtκ)$ factor. VA-CLogUCB shows even greater promise under the stronger triggering probability and variance modulated (TPVM) condition, achieving a leading $\tilde{O}(d\sqrt{T})$ regret, thus removing the additional dependency on the action-size $K$. Furthermore, we enhance the computational efficiency of VA-CLogUCB by eliminating the nonconvex optimization process when the context feature map is time-invariant while maintaining the tight $\tilde{O}(d\sqrt{T})$ regret. Finally, experiments on synthetic and real-world datasets demonstrate the superior performance of our algorithms compared to benchmark algorithms.

Combinatorial Logistic Bandits

TL;DR

This work introduces Combinatorial Logistic Bandits (CLogB), a framework for contextual CMAB with binary base-arm outcomes modeled by a logistic link and probabilistic triggering. It develops three algorithmic families—CLogUCB (variance-agnostic), VA-CLogUCB (variance-adaptive), and EVA-CLogUCB (efficient with burn-in)—and derives regret bounds under 1-norm TPM and TPVM smoothness, achieving leading scalings of and respectively. The methods combine maximum likelihood estimation with novel variance-aware confidence regions, project MLE to manage exploration bonuses, and, for EVA-CLogUCB, incorporate a burn-in stage to remove nonconvex projection overhead while preserving tight regret. Empirical results on synthetic cascading bandits, online content delivery, and real-world PMC data demonstrate substantial regret reductions and practical viability, illustrating the framework's applicability to large-scale nonlinear binary-outcome problems. The work advances scalable, nonlinear contextual bandits with combinatorial actions and binary feedback, with meaningful implications for online content ranking, CDN optimization, and network resource allocation.

Abstract

We introduce a novel framework called combinatorial logistic bandits (CLogB), where in each round, a subset of base arms (called the super arm) is selected, with the outcome of each base arm being binary and its expectation following a logistic parametric model. The feedback is governed by a general arm triggering process. Our study covers CLogB with reward functions satisfying two smoothness conditions, capturing application scenarios such as online content delivery, online learning to rank, and dynamic channel allocation. We first propose a simple yet efficient algorithm, CLogUCB, utilizing a variance-agnostic exploration bonus. Under the 1-norm triggering probability modulated (TPM) smoothness condition, CLogUCB achieves a regret bound of , where ignores logarithmic factors, is the dimension of the feature vector, represents the nonlinearity of the logistic model, and is the maximum number of base arms a super arm can trigger. This result improves on prior work by a factor of . We then enhance CLogUCB with a variance-adaptive version, VA-CLogUCB, which attains a regret bound of under the same 1-norm TPM condition, improving another factor. VA-CLogUCB shows even greater promise under the stronger triggering probability and variance modulated (TPVM) condition, achieving a leading regret, thus removing the additional dependency on the action-size . Furthermore, we enhance the computational efficiency of VA-CLogUCB by eliminating the nonconvex optimization process when the context feature map is time-invariant while maintaining the tight regret. Finally, experiments on synthetic and real-world datasets demonstrate the superior performance of our algorithms compared to benchmark algorithms.

Paper Structure

This paper contains 45 sections, 40 equations, 5 figures, 3 tables, 3 algorithms.

Figures (5)

  • Figure 1: Left: illustration of a sigmoid function with linear predictor $x=\boldsymbol{\theta}^{\top} \boldsymbol{\phi}(i)$ as input. The larger the $|x|$, the flatter the curve is, and the higher the nonlinearity level $\kappa$, where $\kappa$ grows exponentially fast w.r.t $|x|$. Right: CLogB for content delivery networks, the decision maker chooses servers based on contextual features, successfully covers users (green check marks) via edges (solid lines) with probability $p(u,v)$, and gains rewards if the user consumes the content (red play buttons) with probability $p(v)$.
  • Figure 2: (a)-(c) show the results of cascading bandits on the synthetic dataset; (d) shows the results of probabilistic maximum coverage bandits on the real-world dataset (All our algorithms represented by solid lines).
  • Figure 3: Regret vs. running time performance under different dimensions.
  • Figure 4: Comprehensive comparison on cumulative regret and running time performance for all algorithms.
  • Figure 5: Regret comparison across various algorithms, including those utilizing upper bounds on $\kappa$.