Neural Combinatorial Clustered Bandits for Recommendation Systems
Baran Atalar, Carlee Joe-Wong
TL;DR
The paper tackles contextual combinatorial bandits for recommender systems under semi-bandit feedback with unknown reward functions. It introduces NeUClust, which combines two neural networks (for base-arm and monotone super-arm rewards) with online clustering of contexts to guide super-arm selection without requiring an optimization oracle. Theoretical guarantees show a regret bound of $\widetilde{O}(\widetilde{d}\sqrt{T})$, where $\widetilde{d}$ is the effective dimension of the neural tangent kernel, and empirical results on MovieLens and Yelp validate substantial improvements over strong baselines. This approach enhances scalability and practicality for real-world recommendations by eliminating the need for an oracle while exploiting clustered structure in the context space.
Abstract
We consider the contextual combinatorial bandit setting where in each round, the learning agent, e.g., a recommender system, selects a subset of "arms," e.g., products, and observes rewards for both the individual base arms, which are a function of known features (called "context"), and the super arm (the subset of arms), which is a function of the base arm rewards. The agent's goal is to simultaneously learn the unknown reward functions and choose the highest-reward arms. For example, the "reward" may represent a user's probability of clicking on one of the recommended products. Conventional bandit models, however, employ restrictive reward function models in order to obtain performance guarantees. We make use of deep neural networks to estimate and learn the unknown reward functions and propose Neural UCB Clustering (NeUClust), which adopts a clustering approach to select the super arm in every round by exploiting underlying structure in the context space. Unlike prior neural bandit works, NeUClust uses a neural network to estimate the super arm reward and select the super arm, thus eliminating the need for a known optimization oracle. We non-trivially extend prior neural combinatorial bandit works to prove that NeUClust achieves $\widetilde{O}\left(\widetilde{d}\sqrt{T}\right)$ regret, where $\widetilde{d}$ is the effective dimension of a neural tangent kernel matrix, $T$ the number of rounds. Experiments on real world recommendation datasets show that NeUClust achieves better regret and reward than other contextual combinatorial and neural bandit algorithms.
