Table of Contents
Fetching ...

Thompson Sampling for Multi-Objective Linear Contextual Bandit

Somangchan Park, Heesang Ann, Min-hwan Oh

TL;DR

The paper addresses multi-objective linear contextual bandits where trade-offs across conflicting objectives must be managed. It introduces MOL-TS, a Thompson Sampling-based algorithm that operates on an effective Pareto front, avoiding explicit per-round Pareto front computations while delivering Pareto regret guarantees. The authors prove a worst-case regret bound of $\tilde{O}(d^{3/2}\sqrt{T})$ (up to logarithmic factors in the number of objectives and samples) and validate the approach with empirical results showing improved regret and multi-objective performance. The work advances randomized methods in multi-objective bandits by introducing the effective Pareto optimality concept and providing theoretical guarantees alongside practical effectiveness.

Abstract

We study the multi-objective linear contextual bandit problem, where multiple possible conflicting objectives must be optimized simultaneously. We propose \texttt{MOL-TS}, the \textit{first} Thompson Sampling algorithm with Pareto regret guarantees for this problem. Unlike standard approaches that compute an empirical Pareto front each round, \texttt{MOL-TS} samples parameters across objectives and efficiently selects an arm from a novel \emph{effective Pareto front}, which accounts for repeated selections over time. Our analysis shows that \texttt{MOL-TS} achieves a worst-case Pareto regret bound of $\widetilde{O}(d^{3/2}\sqrt{T})$, where $d$ is the dimension of the feature vectors, $T$ is the total number of rounds, matching the best known order for randomized linear bandit algorithms for single objective. Empirical results confirm the benefits of our proposed approach, demonstrating improved regret minimization and strong multi-objective performance.

Thompson Sampling for Multi-Objective Linear Contextual Bandit

TL;DR

The paper addresses multi-objective linear contextual bandits where trade-offs across conflicting objectives must be managed. It introduces MOL-TS, a Thompson Sampling-based algorithm that operates on an effective Pareto front, avoiding explicit per-round Pareto front computations while delivering Pareto regret guarantees. The authors prove a worst-case regret bound of (up to logarithmic factors in the number of objectives and samples) and validate the approach with empirical results showing improved regret and multi-objective performance. The work advances randomized methods in multi-objective bandits by introducing the effective Pareto optimality concept and providing theoretical guarantees alongside practical effectiveness.

Abstract

We study the multi-objective linear contextual bandit problem, where multiple possible conflicting objectives must be optimized simultaneously. We propose \texttt{MOL-TS}, the \textit{first} Thompson Sampling algorithm with Pareto regret guarantees for this problem. Unlike standard approaches that compute an empirical Pareto front each round, \texttt{MOL-TS} samples parameters across objectives and efficiently selects an arm from a novel \emph{effective Pareto front}, which accounts for repeated selections over time. Our analysis shows that \texttt{MOL-TS} achieves a worst-case Pareto regret bound of , where is the dimension of the feature vectors, is the total number of rounds, matching the best known order for randomized linear bandit algorithms for single objective. Empirical results confirm the benefits of our proposed approach, demonstrating improved regret minimization and strong multi-objective performance.

Paper Structure

This paper contains 23 sections, 14 theorems, 82 equations, 13 figures, 1 algorithm.

Key Result

Theorem 1

For any $a_*\in\mathcal{C}^*$, there exist $\boldsymbol{w}\in {\mathcal{S}}^L$ satisfying $a_* = \arg\max_{a\in\mathcal{A}} \boldsymbol{w}^\top\boldsymbol{\mu}_a$. Conversely, for any $\boldsymbol{w}\in {\mathcal{S}}^L$, if $a_* = \arg\max_{a\in\mathcal{A}} \boldsymbol{w}^\top\boldsymbol{\mu}_a$ is

Figures (13)

  • Figure 1: Example of two objectives and four arms, $a_{(1)}, a_{(2)}, a_{(3)}$, and $a_{(4)}$. Each subplot shows the mean reward vector at round $t$, where the horizontal and vertical axes correspond to the first and second objective, respectively. Red circles represent Pareto optimal arms, blue triangles that are not. The mean reward vectors are listed on the right, and the pink line represents the boundary of effective Pareto front (see \ref{['def:CPO']}).
  • Figure 2: Experimental results with $4$ objectives ($L=4$). Plots in the left three columns measure the performances of MOL-TS and the others. Two plots in the first column measure the Pareto regret and the effective Pareto regret. Four plots in the second and third columns measure the cumulative reward for each objective. Plots in the right most column measure the performances of MOL-TS with $M = 1$ and $M = O(\log L)$. The error bars represent the 1-sigma standard deviation over $10$ instances.
  • Figure 3: Experimental results with $K = 50,~ d=10,~L=4$
  • Figure 4: Experimental results with $K = 100,~ d=5,~L=4$
  • Figure 5: Experimental results with $K = 100,~ d=10,~L=4$
  • ...and 8 more figures

Theorems & Definitions (23)

  • Definition 1: Pareto order
  • Definition 2: Pareto optimal arm
  • Definition 3: Pareto sub-optimality gap
  • Definition 4: Pareto regret
  • Definition 5: Effective Pareto optimal arm
  • Theorem 1
  • Definition 6: Effective Pareto sub-optimality gap
  • Definition 7: Effective Pareto regret
  • Lemma 1: Optimistic Sampling
  • Theorem 2: Effective Pareto regret of MOL-TS
  • ...and 13 more