Beyond the Lower Bound: Bridging Regret Minimization and Best Arm Identification in Lexicographic Bandits
Bo Xue, Yuanyu Wan, Zhichao Lu, Qingfu Zhang
TL;DR
This work introduces a unified framework for regret minimization and best arm identification under lexicographic (priority-based) multi-objective bandits. It develops two elimination-based algorithms, LexElim-Out and LexElim-In, that respect lexical priorities while leveraging cross-objective rewards to accelerate learning. Theoretical guarantees show LexElim-Out matches the best-known instance-dependent BAI bounds for the top objective, while LexElim-In achieves faster, cross-objective-aware rates, including minimax regret scaling $\widetilde{O}(\Lambda^i(\lambda)\sqrt{Kt})$ for each objective. Empirical results on synthetic data demonstrate superior performance over baselines, with LexElim-In particularly benefiting from information sharing across objectives. These results highlight the practical value of jointly optimizing RM and BAI in structured, multi-objective decision problems.
Abstract
In multi-objective decision-making with hierarchical preferences, lexicographic bandits provide a natural framework for optimizing multiple objectives in a prioritized order. In this setting, a learner repeatedly selects arms and observes reward vectors, aiming to maximize the reward for the highest-priority objective, then the next, and so on. While previous studies have primarily focused on regret minimization, this work bridges the gap between \textit{regret minimization} and \textit{best arm identification} under lexicographic preferences. We propose two elimination-based algorithms to address this joint objective. The first algorithm eliminates suboptimal arms sequentially, layer by layer, in accordance with the objective priorities, and achieves sample complexity and regret bounds comparable to those of the best single-objective algorithms. The second algorithm simultaneously leverages reward information from all objectives in each round, effectively exploiting cross-objective dependencies. Remarkably, it outperforms the known lower bound for the single-objective bandit problem, highlighting the benefit of cross-objective information sharing in the multi-objective setting. Empirical results further validate their superior performance over baselines.
