Table of Contents
Fetching ...

MUSS: Multilevel Subset Selection for Relevance and Diversity

Vu Nguyen, Andrey Kan

TL;DR

This work tackles the NP-hard problem of selecting a small subset that is both highly relevant and diverse. It introduces MUSS, a multilevel distributed method that clusters data, selects a subset of clusters, and then performs parallel, within-cluster greedy selection before a final refinement step to produce a size-$k$ subset that optimizes $F(S)=\lambda Q(S)+(1-\lambda) D(S)$. The authors establish a constant-factor approximation framework and prove bounds that tighten theDGDS analysis, while demonstrating strong empirical gains in both product recommendation and RAG QA tasks, including substantial speedups (up to 80x faster than mmr) and production deployment for large-scale candidate retrieval. The combination of clustering-based pruning, parallelizable within-cluster selection, and theoretical guarantees yields a scalable solution with practical impact for large-scale ML systems that require diverse yet relevant recommendations or retrieved items.

Abstract

The problem of relevant and diverse subset selection has a wide range of applications, including recommender systems and retrieval-augmented generation (RAG). For example, in recommender systems, one is interested in selecting relevant items, while providing a diversified recommendation. Constrained subset selection problem is NP-hard, and popular approaches such as Maximum Marginal Relevance (MMR) are based on greedy selection. Many real-world applications involve large data, but the original MMR work did not consider distributed selection. This limitation was later addressed by a method called DGDS which allows for a distributed setting using random data partitioning. Here, we exploit structure in the data to further improve both scalability and performance on the target application. We propose MUSS, a novel method that uses a multilevel approach to relevant and diverse selection. In a recommender system application, our method can not only improve the performance up to $4$ percent points in precision, but is also $20$ to $80$ times faster. Our method is also capable of outperforming baselines on RAG-based question answering accuracy. We present a novel theoretical approach for analyzing this type of problems, and show that our method achieves a constant factor approximation of the optimal objective. Moreover, our analysis also resulted in a $\times 2$ tighter bound for DGDS compared to previously known bound.

MUSS: Multilevel Subset Selection for Relevance and Diversity

TL;DR

This work tackles the NP-hard problem of selecting a small subset that is both highly relevant and diverse. It introduces MUSS, a multilevel distributed method that clusters data, selects a subset of clusters, and then performs parallel, within-cluster greedy selection before a final refinement step to produce a size- subset that optimizes . The authors establish a constant-factor approximation framework and prove bounds that tighten theDGDS analysis, while demonstrating strong empirical gains in both product recommendation and RAG QA tasks, including substantial speedups (up to 80x faster than mmr) and production deployment for large-scale candidate retrieval. The combination of clustering-based pruning, parallelizable within-cluster selection, and theoretical guarantees yields a scalable solution with practical impact for large-scale ML systems that require diverse yet relevant recommendations or retrieved items.

Abstract

The problem of relevant and diverse subset selection has a wide range of applications, including recommender systems and retrieval-augmented generation (RAG). For example, in recommender systems, one is interested in selecting relevant items, while providing a diversified recommendation. Constrained subset selection problem is NP-hard, and popular approaches such as Maximum Marginal Relevance (MMR) are based on greedy selection. Many real-world applications involve large data, but the original MMR work did not consider distributed selection. This limitation was later addressed by a method called DGDS which allows for a distributed setting using random data partitioning. Here, we exploit structure in the data to further improve both scalability and performance on the target application. We propose MUSS, a novel method that uses a multilevel approach to relevant and diverse selection. In a recommender system application, our method can not only improve the performance up to percent points in precision, but is also to times faster. Our method is also capable of outperforming baselines on RAG-based question answering accuracy. We present a novel theoretical approach for analyzing this type of problems, and show that our method achieves a constant factor approximation of the optimal objective. Moreover, our analysis also resulted in a tighter bound for DGDS compared to previously known bound.

Paper Structure

This paper contains 33 sections, 9 theorems, 37 equations, 9 figures, 7 tables, 2 algorithms.

Key Result

Lemma 1

Apply Algorithm 1 to select ${\mathcal{S}} = \textsc{alg}1({\mathcal{T}}|k)$. Let ${\bm{t}} \in {\mathcal{T}} \setminus {\mathcal{S}}$. The following inequalities hold

Figures (9)

  • Figure 1: Our muss is not only capable of achieving better performance on the target task as baselines, but also can be $20 \times$ to $80 \times$ faster. The insert shows the relative speed improvement against dgds. Note that mmr is not a distributed method. Here, the task has been to select $k$ candidate items for recommendation from catalogs of different sizes and $k'$ denotes the number of intermediate items to be selected within each cluster for muss and dgds.
  • Figure 2: muss performs clustering following by a multilevel selection. Here, $\bar{{\mathcal{S}}}$ is a set of selected clusters, ${\mathcal{U}}_m$ denotes cluster $m$, and ${\mathcal{S}}_m$ denotes items selected from that cluster.
  • Figure 3: Flow chart of candidate retrieval module within the real-time ranking framework. The goal is to select the subset of $k$ products which are high quality and diverse every hour. We run this retrieval step per category and is non-personalized.
  • Figure 4: tSNE Visualization of selecting $k=100$ items for "Home" and "Kitchen" datasets. Data forms clusters. Our method performs high-quality and diverse selection as shown by the red dots. The color scale indicates the quality score of the item.
  • Figure 5: Diversity, quality, and the objective as the function of $\lambda_c$ and $\lambda$ for Kitchen dataset
  • ...and 4 more figures

Theorems & Definitions (18)

  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Theorem 4
  • Lemma 5
  • Lemma 6
  • Lemma 7
  • Theorem 8
  • proof
  • proof
  • ...and 8 more