Sequential Learning of the Pareto Front for Multi-objective Bandits
Elise Crépon, Aurélien Garivier, Wouter M Koolen
TL;DR
The paper targets fast, fixed-confidence identification of the Pareto front in multi-objective bandits with rewards in $\mathbb{R}^d$ and $K$ arms, aiming for success probability $1-\delta$. It adapts the Track-and-Stop framework to Pareto-front identification and optimizes the inner, information-theoretic lower-bound problem via online gradient ascent, focusing on Gaussian arms with identity covariance. A key contribution is the decomposition of the transport cost into removing or adding Pareto points and a cell-based algorithm that achieves instance-optimal sample complexity with per-round cost on the order of $O(K p^d)$; the authors derive explicit costs for removal and develop a tractable procedure for addition, including a cell enumeration strategy with a combinatorial bound $\binom{p+d-1}{d-1}$. Empirical results on real and synthetic data demonstrate substantial improvements in sample efficiency and provide practical insights into scalability for moderate $p$ and $d$.
Abstract
We study the problem of sequential learning of the Pareto front in multi-objective multi-armed bandits. An agent is faced with K possible arms to pull. At each turn she picks one, and receives a vector-valued reward. When she thinks she has enough information to identify the Pareto front of the different arm means, she stops the game and gives an answer. We are interested in designing algorithms such that the answer given is correct with probability at least 1-$δ$. Our main contribution is an efficient implementation of an algorithm achieving the optimal sample complexity when the risk $δ$ is small. With K arms in d dimensions p of which are in the Pareto set, the algorithm runs in time O(Kp^d) per round.
