MUSS: Multilevel Subset Selection for Relevance and Diversity
Vu Nguyen, Andrey Kan
TL;DR
This work tackles the NP-hard problem of selecting a small subset that is both highly relevant and diverse. It introduces MUSS, a multilevel distributed method that clusters data, selects a subset of clusters, and then performs parallel, within-cluster greedy selection before a final refinement step to produce a size-$k$ subset that optimizes $F(S)=\lambda Q(S)+(1-\lambda) D(S)$. The authors establish a constant-factor approximation framework and prove bounds that tighten theDGDS analysis, while demonstrating strong empirical gains in both product recommendation and RAG QA tasks, including substantial speedups (up to 80x faster than mmr) and production deployment for large-scale candidate retrieval. The combination of clustering-based pruning, parallelizable within-cluster selection, and theoretical guarantees yields a scalable solution with practical impact for large-scale ML systems that require diverse yet relevant recommendations or retrieved items.
Abstract
The problem of relevant and diverse subset selection has a wide range of applications, including recommender systems and retrieval-augmented generation (RAG). For example, in recommender systems, one is interested in selecting relevant items, while providing a diversified recommendation. Constrained subset selection problem is NP-hard, and popular approaches such as Maximum Marginal Relevance (MMR) are based on greedy selection. Many real-world applications involve large data, but the original MMR work did not consider distributed selection. This limitation was later addressed by a method called DGDS which allows for a distributed setting using random data partitioning. Here, we exploit structure in the data to further improve both scalability and performance on the target application. We propose MUSS, a novel method that uses a multilevel approach to relevant and diverse selection. In a recommender system application, our method can not only improve the performance up to $4$ percent points in precision, but is also $20$ to $80$ times faster. Our method is also capable of outperforming baselines on RAG-based question answering accuracy. We present a novel theoretical approach for analyzing this type of problems, and show that our method achieves a constant factor approximation of the optimal objective. Moreover, our analysis also resulted in a $\times 2$ tighter bound for DGDS compared to previously known bound.
