Table of Contents
Fetching ...

Learning the Pareto Front Using Bootstrapped Observation Samples

Wonyoung Kim, Garud Iyengar, Assaf Zeevi

TL;DR

This work tackles Pareto front identification in linear contextual bandits (PFILin) by introducing PFIwR, an algorithm that attains near-optimal sample complexity (up to polylog factors) and near-optimal Pareto regret. It hinges on two innovations: (i) an exploration-mixed estimator that updates rewards along multiple context directions by recycling exploration samples via a context-basis representation, and (ii) a doubly-robust estimator that imputes missing rewards to maintain unbiased learning for all arms. By reducing the arm-reward learning problem to a small set of context-basis rewards and coupling with a DR scheme, PFIwR enables efficient identification of the Pareto front even when the arm set is large or exponentially many. Theoretical results show $ ilde{O}( ext{something like } heta_{ ext{max}}d^{3} ext{L}/( ext{gap})^2)$-type sample complexity and regret bounds, and experiments demonstrate effective convergence on all arms and superior performance over prior methods such as MultiPFI in both identification and Pareto-regret minimization. This approach has practical impact for multi-objective online decision-making with linear context models, including medical decision support and recommender systems, where identifying all potentially optimal actions with constrained sampling and controlled regret is crucial.

Abstract

We consider Pareto front identification (PFI) for linear bandits (PFILin), i.e., the goal is to identify a set of arms with undominated mean reward vectors when the mean reward vector is a linear function of the context. PFILin includes the best arm identification problem and multi-objective active learning as special cases. The sample complexity of our proposed algorithm is optimal up to a logarithmic factor. In addition, the regret incurred by our algorithm during the estimation is within a logarithmic factor of the optimal regret among all algorithms that identify the Pareto front. Our key contribution is a new estimator that in every round updates the estimate for the unknown parameter along multiple context directions -- in contrast to the conventional estimator that only updates the parameter estimate along the chosen context. This allows us to use low-regret arms to collect information about Pareto optimal arms. Our key innovation is to reuse the exploration samples multiple times; in contrast to conventional estimators that use each sample only once. Numerical experiments demonstrate that the proposed algorithm successfully identifies the Pareto front while controlling the regret.

Learning the Pareto Front Using Bootstrapped Observation Samples

TL;DR

This work tackles Pareto front identification in linear contextual bandits (PFILin) by introducing PFIwR, an algorithm that attains near-optimal sample complexity (up to polylog factors) and near-optimal Pareto regret. It hinges on two innovations: (i) an exploration-mixed estimator that updates rewards along multiple context directions by recycling exploration samples via a context-basis representation, and (ii) a doubly-robust estimator that imputes missing rewards to maintain unbiased learning for all arms. By reducing the arm-reward learning problem to a small set of context-basis rewards and coupling with a DR scheme, PFIwR enables efficient identification of the Pareto front even when the arm set is large or exponentially many. Theoretical results show -type sample complexity and regret bounds, and experiments demonstrate effective convergence on all arms and superior performance over prior methods such as MultiPFI in both identification and Pareto-regret minimization. This approach has practical impact for multi-objective online decision-making with linear context models, including medical decision support and recommender systems, where identifying all potentially optimal actions with constrained sampling and controlled regret is crucial.

Abstract

We consider Pareto front identification (PFI) for linear bandits (PFILin), i.e., the goal is to identify a set of arms with undominated mean reward vectors when the mean reward vector is a linear function of the context. PFILin includes the best arm identification problem and multi-objective active learning as special cases. The sample complexity of our proposed algorithm is optimal up to a logarithmic factor. In addition, the regret incurred by our algorithm during the estimation is within a logarithmic factor of the optimal regret among all algorithms that identify the Pareto front. Our key contribution is a new estimator that in every round updates the estimate for the unknown parameter along multiple context directions -- in contrast to the conventional estimator that only updates the parameter estimate along the chosen context. This allows us to use low-regret arms to collect information about Pareto optimal arms. Our key innovation is to reuse the exploration samples multiple times; in contrast to conventional estimators that use each sample only once. Numerical experiments demonstrate that the proposed algorithm successfully identifies the Pareto front while controlling the regret.
Paper Structure (31 sections, 21 theorems, 175 equations, 4 figures, 1 table, 1 algorithm)

This paper contains 31 sections, 21 theorems, 175 equations, 4 figures, 1 table, 1 algorithm.

Key Result

Theorem 3.3

Fix $\epsilon>0$, and let $\Delta_{(k),\epsilon}:=\max\{\Delta_{(k)},\epsilon\}$. Suppose the set of context vectors $\mathcal{X}$ spans $\mathbb{R}^{d}$ and $\|\theta_{\star}^{\langle\ell\rangle}\|_{0} = d$, for all $\ell \in [L]$. Then, for any $\delta\in(0,1/4)$ and $\sigma>0$, there exist a $\si

Figures (4)

  • Figure 1: Estimation errors of the proposed DR-mix estimator \ref{['eq:estimator']} with the conventional ridge estimator, and the exploration-mixed estimator \ref{['eq:mixup']} for a 3-armed bandit problem. The line and shade represent the average and standard deviation over 1000 independent experiments. The estimators use samples from all arms for $n\in[50]$, and after that, only observe rewards from arm $1$.
  • Figure 2: Comparison of PFIwR and MultiPFI on the SW-LLVM dataset. Both algorithms correctly identify the $\epsilon$-near Pareto optimal arms on all 500 independent experiments.
  • Figure 3: The $\ell_2$-error of the reward on the unexploited arms (arms 2 and 3) of the proposed estimator and the DR estimator whose imputation estimator is the conventional ridge estimator in the 3-armed bandit problem (for detailed setting, see Section \ref{['subsec:experiment_estimator']}.) The estimators use samples from all three arms when $t\le50$ and only arm 1 when $t>50$. When constructing a DR estimator, choosing the imputation estimator that learns rewards on all arms is crucial for convergence on all arms.
  • Figure 4: Changes in densities of $\sqrt{n}(\widehat{\theta}-\theta_{\star})$ over the number of samples $n=50, 500, 2000$ on the exploited arm (arm 1) and the unexploited arm (arm 2). The vertical line represents the average computed from 1000 independent experiments. The proposed DR-mix estimator converges faster with lower variance than the ridge and exploration-mixed estimator on all arms.

Theorems & Definitions (38)

  • Definition 3.1: Pareto Front
  • Definition 3.2: PFI success condition
  • Theorem 3.3: A lower bound of the sample complexity for PFILin.
  • Lemma 4.1
  • Theorem 4.2: Estimation error bound for the DR-mix estimator
  • Theorem 5.1: An upper bound on sample complexity
  • Theorem 5.2: Upper bounds on Pareto regret
  • Theorem 5.3: A regret lower bound for in PFILin
  • Lemma B.1
  • proof
  • ...and 28 more