Merit-based Fair Combinatorial Semi-Bandit with Unrestricted Feedback Delays

Ziqun Chen; Kechao Cai; Zhuoyue Chen; Jinbei Zhang; John C. S. Lui

Merit-based Fair Combinatorial Semi-Bandit with Unrestricted Feedback Delays

Ziqun Chen, Kechao Cai, Zhuoyue Chen, Jinbei Zhang, John C. S. Lui

TL;DR

This work studies stochastic combinatorial multi-armed bandits with unrestricted feedback delays under merit-based fairness constraints, motivated by crowdsourcing and online advertising. It introduces two delay models—reward-independent and reward-dependent—and proves that the optimal fair policy induces arm-selection probabilities proportional to a merit function: $p_a^* = \dfrac{L f(\mu_a)}{\sum_{a'} f(\mu_{a'})}$. The authors design fair CMAB algorithms (FCUCB-D, FCTS-D, OP-FCUCB-D, OP-FCTS-D) tailored to each delay setting and establish sublinear expected reward and fairness regrets, with regret bounds depending on delay quantiles $d^*(q)$. Across extensive synthetic and real-data experiments, the methods achieve merit-based fairness while maintaining favorable reward performance, even under heavy-tailed or infinite-delay regimes. The results demonstrate that fairness-aware, delay-tolerant bandits can effectively balance exploration/exploitation with equitable arm treatment in practical applications.

Abstract

We study the stochastic combinatorial semi-bandit problem with unrestricted feedback delays under merit-based fairness constraints. This is motivated by applications such as crowdsourcing, and online advertising, where immediate feedback is not immediately available and fairness among different choices (or arms) is crucial. We consider two types of unrestricted feedback delays: reward-independent delays where the feedback delays are independent of the rewards, and reward-dependent delays where the feedback delays are correlated with the rewards. Furthermore, we introduce merit-based fairness constraints to ensure a fair selection of the arms. We define the reward regret and the fairness regret and present new bandit algorithms to select arms under unrestricted feedback delays based on their merits. We prove that our algorithms all achieve sublinear expected reward regret and expected fairness regret, with a dependence on the quantiles of the delay distribution. We also conduct extensive experiments using synthetic and real-world data and show that our algorithms can fairly select arms with different feedback delays.

Merit-based Fair Combinatorial Semi-Bandit with Unrestricted Feedback Delays

TL;DR

. The authors design fair CMAB algorithms (FCUCB-D, FCTS-D, OP-FCUCB-D, OP-FCTS-D) tailored to each delay setting and establish sublinear expected reward and fairness regrets, with regret bounds depending on delay quantiles

. Across extensive synthetic and real-data experiments, the methods achieve merit-based fairness while maintaining favorable reward performance, even under heavy-tailed or infinite-delay regimes. The results demonstrate that fairness-aware, delay-tolerant bandits can effectively balance exploration/exploitation with equitable arm treatment in practical applications.

Abstract

Paper Structure (19 sections, 16 theorems, 110 equations, 9 figures, 4 algorithms)

This paper contains 19 sections, 16 theorems, 110 equations, 9 figures, 4 algorithms.

Related work
Fair CMAB with General Feedback Delays
Algorithms for Reward-independent Delays
FCUCB-D Algorithm
FCTS-D Algorithm
Algorithms for Reward-dependent Delays
OP-FCUCB-D Algorithm
OP-FCTS-D Algorithm
Experiments
Conclusion & Future Work
Proofs of the Theorems
Proof of Theorem \ref{['the:Optimal-Fair-Policy']}
Proof of Theorem \ref{['the:Lower-bound-fr-without-Assumption']}
Proof of Theorem \ref{['the:fairness-reward-regret-ucb-type']}
Proof of Theorem \ref{['the:fairness-reward-regret-thompson-sam-type']}
...and 4 more sections

Key Result

Theorem 3

For any $\mu_a, a\in[K]$ and any choice of merit function $f(\cdot) > 0$, there exist a unique optimal fair policy $\bm{p}^*=\left\lbrace p^*_1,p^*_2,...,p^*_K \right\rbrace$ such that that satisfies the merit-based fairness constraints in fairness-constraint.

Figures (9)

Figure 1: Comparison of different bandit algorithms under geometric feedback delays.
Figure 2: Experiment results of different bandit algorithms under different types of feedback delays.
Figure 3: Experiment results using the real-world conversion log dataset.
Figure 4: Experiment results of the different bandit algorithms using different merit functions algorithms under fixed feedback delays (200 rounds).
Figure 5: Experiment results of the different bandit algorithms using different merit functions under geometric feedback delays.
...and 4 more figures

Theorems & Definitions (31)

Theorem 3
Theorem 4
Remark 5
Theorem 6
Theorem 7
Theorem 8
Theorem 9
Remark 10
proof
proof
...and 21 more

Merit-based Fair Combinatorial Semi-Bandit with Unrestricted Feedback Delays

TL;DR

Abstract

Merit-based Fair Combinatorial Semi-Bandit with Unrestricted Feedback Delays

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (31)