Nearly Minimax Optimal Submodular Maximization with Bandit Feedback
Artin Tajdini, Lalit Jain, Kevin Jamieson
TL;DR
We study maximizing an unknown monotone submodular function $f$ under a cardinality constraint with bandit feedback. The paper develops minimax lower bounds on robust greedy regret $R_{gr}$ and introduces Sub-UCB, an algorithm that interpolates between greedy exploration and full-set UCB. The main results show $R_{gr} = \tilde{\Omega}(\min_{0\le L\le k}(L^{1/3}n^{1/3}T^{2/3} + \sqrt{{n \choose k - L}T}))$ and $R_{gr} = \tilde{\mathcal{O}}(\min_{L\le k}(L n^{1/3}T^{2/3} + \sqrt{{n \choose k - L}T}))$ for Sub-UCB, establishing minimax optimality up to logarithmic factors. This work provides the first tight results for submodular bandits with bandit feedback and offers practical guidance on balancing partial greedy growth with exploration to achieve near-optimal performance.
Abstract
We consider maximizing an unknown monotonic, submodular set function $f: 2^{[n]} \rightarrow [0,1]$ with cardinality constraint under stochastic bandit feedback. At each time $t=1,\dots,T$ the learner chooses a set $S_t \subset [n]$ with $|S_t| \leq k$ and receives reward $f(S_t) + η_t$ where $η_t$ is mean-zero sub-Gaussian noise. The objective is to minimize the learner's regret with respect to an approximation of the maximum $f(S_*)$ with $|S_*| = k$, obtained through robust greedy maximization of $f$. To date, the best regret bound in the literature scales as $k n^{1/3} T^{2/3}$. And by trivially treating every set as a unique arm one deduces that $\sqrt{ {n \choose k} T }$ is also achievable using standard multi-armed bandit algorithms. In this work, we establish the first minimax lower bound for this setting that scales like $\tildeΩ(\min_{L \le k}(L^{1/3}n^{1/3}T^{2/3} + \sqrt{{n \choose k - L}T}))$. For a slightly restricted algorithm class, we prove a stronger regret lower bound of $\tildeΩ(\min_{L \le k}(Ln^{1/3}T^{2/3} + \sqrt{{n \choose k - L}T}))$. Moreover, we propose an algorithm Sub-UCB that achieves regret $\tilde{\mathcal{O}}(\min_{L \le k}(Ln^{1/3}T^{2/3} + \sqrt{{n \choose k - L}T}))$ capable of matching the lower bound on regret for the restricted class up to logarithmic factors.
