Online Clustering of Dueling Bandits
Zhiyong Wang, Jiahang Sun, Mingze Kong, Jize Xie, Qinghua Hu, John C. S. Lui, Zhongxiang Dai
TL;DR
This work extends contextual bandits by enabling cross-user collaboration through online graph-based clustering of dueling bandits. It introduces two algorithms: COLDB for linear reward models and CONDB for neural reward models, both updating a clustering graph and sharing data within inferred clusters to improve regret. Theoretical analyses establish sublinear regret bounds that improve as the number of clusters decreases, validating the benefit of collaboration under preference feedback. Empirical results on synthetic data and MovieLens confirm substantial gains over single-user baselines, especially when fewer clusters imply more shared information. Overall, the paper provides a principled framework and rigorous guarantees for clustering-based collaboration in dueling bandit settings with preference feedback.
Abstract
The contextual multi-armed bandit (MAB) is a widely used framework for problems requiring sequential decision-making under uncertainty, such as recommendation systems. In applications involving a large number of users, the performance of contextual MAB can be significantly improved by facilitating collaboration among multiple users. This has been achieved by the clustering of bandits (CB) methods, which adaptively group the users into different clusters and achieve collaboration by allowing the users in the same cluster to share data. However, classical CB algorithms typically rely on numerical reward feedback, which may not be practical in certain real-world applications. For instance, in recommendation systems, it is more realistic and reliable to solicit preference feedback between pairs of recommended items rather than absolute rewards. To address this limitation, we introduce the first "clustering of dueling bandit algorithms" to enable collaborative decision-making based on preference feedback. We propose two novel algorithms: (1) Clustering of Linear Dueling Bandits (COLDB) which models the user reward functions as linear functions of the context vectors, and (2) Clustering of Neural Dueling Bandits (CONDB) which uses a neural network to model complex, non-linear user reward functions. Both algorithms are supported by rigorous theoretical analyses, demonstrating that user collaboration leads to improved regret bounds. Extensive empirical evaluations on synthetic and real-world datasets further validate the effectiveness of our methods, establishing their potential in real-world applications involving multiple users with preference-based feedback.
