Distributed Multi-Task Learning for Stochastic Bandits with Context Distribution and Stage-wise Constraints

Jiabin Lin; Shana Moothedath

Distributed Multi-Task Learning for Stochastic Bandits with Context Distribution and Stage-wise Constraints

Jiabin Lin, Shana Moothedath

TL;DR

This work tackles distributed multi-task stochastic linear contextual bandits where exact contexts are hidden and only context distributions are available, under per-round safety constraints. It introduces DiSC-UCB, a conservative, UCB-based algorithm that prunes unsafe actions using context-distribution features and synchronizes estimates via a central server, achieving a high-probability regret of $\widetilde{O}(d\sqrt{MT})$ and communication cost $O(M^{1.5}d^3)$. The paper further extends to the unknown-baseline-reward setting with DiSC-UCB-UB, preserving the same regret/communication order, and validates the approach on synthetic data and real-world Movielens-100K data. These results demonstrate the practicality of safe, distributed learning across related tasks and identify concrete metrics for performance and communication efficiency in context-distribution-driven bandit problems.

Abstract

We present conservative distributed multi-task learning in stochastic linear contextual bandits with heterogeneous agents. This extends conservative linear bandits to a distributed setting where M agents tackle different but related tasks while adhering to stage-wise performance constraints. The exact context is unknown, and only a context distribution is available to the agents as in many practical applications that involve a prediction mechanism to infer context, such as stock market prediction and weather forecast. We propose a distributed upper confidence bound (UCB) algorithm, DiSC-UCB. Our algorithm constructs a pruned action set during each round to ensure the constraints are met. Additionally, it includes synchronized sharing of estimates among agents via a central server using well-structured synchronization steps. We prove the regret and communication bounds on the algorithm. We extend the problem to a setting where the agents are unaware of the baseline reward. For this setting, we provide a modified algorithm, DiSC-UCB2, and we show that the modified algorithm achieves the same regret and communication bounds. We empirically validated the performance of our algorithm on synthetic data and real-world Movielens-100K data.

Distributed Multi-Task Learning for Stochastic Bandits with Context Distribution and Stage-wise Constraints

TL;DR

and communication cost

. The paper further extends to the unknown-baseline-reward setting with DiSC-UCB-UB, preserving the same regret/communication order, and validates the approach on synthetic data and real-world Movielens-100K data. These results demonstrate the practicality of safe, distributed learning across related tasks and identify concrete metrics for performance and communication efficiency in context-distribution-driven bandit problems.

Abstract

Paper Structure (21 sections, 18 theorems, 82 equations, 6 figures, 1 algorithm)

This paper contains 21 sections, 18 theorems, 82 equations, 6 figures, 1 algorithm.

Introduction
Our Contributions
Related Work
Stochastic Linear Contextual Bandits
Problem Formulation and Notation
Distributed Stage-wise Contextual Bandits with Context Distribution
Proposed Algorithm
Theoretical Analysis on Safety Guarantees
Regret Analysis
Unknown Baseline Reward
Numerical Experiments
Datasets
Comparison of DiSC-UCB with Existing Constrained and Distributed Approaches
Regret versus System Parameters
Conclusion
...and 6 more sections

Key Result

Lemma 1

For any $\delta > 0$, with a probability of $1 - M \delta$, $\theta^\star$ will always exist inside the confidence set ${\pazocal{B}}_{{t, i }}$ defined by Eq. eq:conf where $\beta_{t, i } = \beta_{t, i }(\sqrt{1 + \sigma^2}, \delta / 2)$ for all value of $t$ and $i$.

Figures (6)

Figure 1: Comparison of cumulative regret and cumulative violation of DiSC-UCB with SCLTS moradipari2020stage, DisLinUCB wang2019distributed, DisLSB Jiabin_Shana_ACC, and Fed-PE huang2021federated modified for unknown context Fed-PECD using lin2023federated. Synthetic data: In Figs. \ref{['fig:1']}, \ref{['fig:2']} we set the parameters as $\lambda=1$, $d=2$, $R=1$, $K=40$, $M=1$, $\theta^\star = [0.9, 0.4]$, and noise variance $=2.5 \times 10^{-3}$, and the baseline action is set by the 10$^{\rm th}$ best action. In Figures \ref{['fig:1b']}, \ref{['fig:2b']}, we set the parameters as: $\lambda=0.1$, $R=0.1$, $d=2$, number of contexts $|{\pazocal{C}}|=100$, number of actions $K=10$, and number of agents $M=3$. We considered a noise with a mean of $0$ and a variance of $0.01$ to obtain $\psi$ from $\phi$. The true parameters are $\theta_1^\star= [0.9, 0.4]$, $\theta_2^\star= [0.9, 0]$, and $\theta_3^\star= [0, 0.4]$. The baseline is the 2nd best action, and $\alpha=0.25$. All plots were averaged over 100 independent trials.
Figure 2: Synthetic data: In Figure \ref{['fig:7']}, we set the parameters as $\lambda=1$, $d=2$, $R=1$, $K=40$, $M=1$, $\theta^\star = [0.9, 0.4]$, and noise variance $=2.5 \times 10^{-3}$, and the baseline action is set by the 10$^{\rm th}$ best action . In Figure \ref{['fig:8']}, the parameters are set as $R=0.1$, $K=90$, $M=3$, $\theta^\star = [1, 1]$, noise variance $=10^{-4}$, and the baseline action of a particular round is set as the 80$^{\rm th}$ best action of that round. The $\alpha$ values are varied as $\alpha = \{0.1, 0.3, 0.5\}$. In Fig. \ref{['fig:3']}, $R=1$, $K=90$, $M=\{3,5,10\}$, $\theta^\star = [1, 1]$, the reward parameters for the different tasks $\theta^\star_i \in \Theta=\{[1, 1], [1, 0],[0, 1]\}$, noise variance $=10^{-2}$, and baseline is the 30$^{\rm th}$ best action. Movielens data: In Figs. \ref{['fig:4']}, \ref{['fig:6']}, and \ref{['fig:5']}, $R=0.1$, $K=50$, noise variance$=10^{-2}$, $\theta^\star = \frac{1}{\sqrt{3}}[1, 0, 0, 0, 1, 0, 0, 0, 1]$, and the baseline action is set as the 40$^{\rm th}$ best action. LastFM data: In Figs.\ref{['fig:2_a']}, \ref{['fig:2_b']}, \ref{['fig:2_c']}, $R=\lambda=0.05$, $K=50$, noise variance$=10^{-3}$, $\theta^\star = \frac{1}{\sqrt{3}}[1, 0, 0, 0, 1, 0, 0, 0, 1]$, and the baseline action is set as the 5$^{\rm th}$ best action.
Figure 3: An example demonstrating how an unsafe action $x_1"$ under true context appears to be safe under noisy context observation leading to incorrect conclusions about safe actions.
Figure 4: Figures \ref{['fig:11']} and \ref{['fig:9']}: Comparison of cumulative regret of DiSC-UCB with DiSC-UCB-UB for synthetic and movielens datasets. Synthetic data: In Figs. \ref{['fig:11']}, we set $R=0.1$, $K=90$, $M=1$, $\theta^\star = [1, 1]$, noise variance $=10^{-2}$, and baseline is the 80$^{\rm th}$ best action. Movielens data: In Figs. \ref{['fig:9']}, $R=0.1$, $K=50$, noise variance$=10^{-2}$, $\theta^\star = \frac{1}{\sqrt{3}}[1, 0, 0, 0, 1, 0, 0, 0, 1]$, and the baseline action is set as the 45$^{\rm th}$ best action. Figures \ref{['compare_noise']}: The plot shows the cumulative regret vs. round plot for the three different noise models given in Section \ref{['sec:noise_comp']}.
Figure :
...and 1 more figures

Theorems & Definitions (36)

Lemma 1
proof
Lemma 2
proof
Lemma 3
proof
Lemma 4
proof
Remark
Lemma 5
...and 26 more

Distributed Multi-Task Learning for Stochastic Bandits with Context Distribution and Stage-wise Constraints

TL;DR

Abstract

Distributed Multi-Task Learning for Stochastic Bandits with Context Distribution and Stage-wise Constraints

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (36)