Table of Contents
Fetching ...

The Best of Both Worlds: Bridging Quality and Diversity in Data Selection with Bipartite Graph

Minghao Wu, Thuy-Trang Vu, Lizhen Qu, Gholamreza Haffari

TL;DR

This work addresses the challenge of selecting high-quality, diverse data for supervised fine-tuning of large language models by framing data selection as a set cover problem and introducing GraphFilter. GraphFilter builds a bipartite graph between sentences and their $n$-grams, and uses a multiplicative priority $\\phi(u)=\textsc{Quality}(u)\times\textsc{Diversity}(u)$ to greedily select a subset of size $k$, balancing informativeness (via SuperFilter) with n-gram coverage (via TF-IDF across $n\in\{1,2,3\}$). The method relates to the classical set cover problem and, with a max-heap implementation, achieves $O(\log N)$ per iteration, enabling CPU-efficient runs. Empirically, GraphFilter outperforms nine baselines across three backbones and six benchmarks, while also reducing computation time and highlighting the importance of instruction diversity. Analyses reveal that combining multiple $n$-gram levels and balancing quality and diversity yields robust improvements across budgets, with trigram features providing an optimal trade-off between performance and efficiency.

Abstract

The performance of large language models (LLMs) is strongly influenced by the quality and diversity of data used during supervised fine-tuning (SFT). However, current data selection methods often prioritize one aspect over the other, resulting in suboptimal training outcomes. To address this, we formulate data selection as a set cover problem and present GraphFilter, a novel approach that balances both quality and diversity in data selection. GraphFilter models the dataset as a bipartite graph connecting sentences to their constituent n-grams, then employs a priority function that combines quality and diversity metrics multiplicatively. GraphFilter iteratively selects sentences with the highest priority, removes covered n-grams from the bipartite graph, and recomputes priorities to reflect the changing data landscape. We validate GraphFilter using three model backbones across six widely-used benchmarks, demonstrating that it outperforms nine existing baselines in both model performance and computational efficiency. Further analysis shows that our design choices lead to more effective subset selection, underscores the value of instruction diversity, and provides insights into how quality and diversity interact with different subset sizes.

The Best of Both Worlds: Bridging Quality and Diversity in Data Selection with Bipartite Graph

TL;DR

This work addresses the challenge of selecting high-quality, diverse data for supervised fine-tuning of large language models by framing data selection as a set cover problem and introducing GraphFilter. GraphFilter builds a bipartite graph between sentences and their -grams, and uses a multiplicative priority to greedily select a subset of size , balancing informativeness (via SuperFilter) with n-gram coverage (via TF-IDF across ). The method relates to the classical set cover problem and, with a max-heap implementation, achieves per iteration, enabling CPU-efficient runs. Empirically, GraphFilter outperforms nine baselines across three backbones and six benchmarks, while also reducing computation time and highlighting the importance of instruction diversity. Analyses reveal that combining multiple -gram levels and balancing quality and diversity yields robust improvements across budgets, with trigram features providing an optimal trade-off between performance and efficiency.

Abstract

The performance of large language models (LLMs) is strongly influenced by the quality and diversity of data used during supervised fine-tuning (SFT). However, current data selection methods often prioritize one aspect over the other, resulting in suboptimal training outcomes. To address this, we formulate data selection as a set cover problem and present GraphFilter, a novel approach that balances both quality and diversity in data selection. GraphFilter models the dataset as a bipartite graph connecting sentences to their constituent n-grams, then employs a priority function that combines quality and diversity metrics multiplicatively. GraphFilter iteratively selects sentences with the highest priority, removes covered n-grams from the bipartite graph, and recomputes priorities to reflect the changing data landscape. We validate GraphFilter using three model backbones across six widely-used benchmarks, demonstrating that it outperforms nine existing baselines in both model performance and computational efficiency. Further analysis shows that our design choices lead to more effective subset selection, underscores the value of instruction diversity, and provides insights into how quality and diversity interact with different subset sizes.

Paper Structure

This paper contains 40 sections, 4 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: An example of a single iteration of GraphFilter without the priority function. In this case, the degree of a sentence node serves as the priority score. Sentence nodes are in blue and n-gram nodes in green. The selected sentence node is yellow, while connected n-gram nodes are red. Removed n-gram nodes are white, with removed edges as dashed lines. Node $u_1$ is selected in the current iteration, and $u_4$ will be the next.
  • Figure 2: \ref{['fig:tsne_qd_relation']} displays the quality-diversity relationships of subsets selected by different methods, with $\uparrow$ indicating a preference for higher values. \ref{['fig:tsne_armorm']} shows the semantic diversity in a t-SNE plot of subsets from GraphFilter and ArmoRM, where green rectangles indicate data points chosen by GraphFilter but not by ArmoRM. \ref{['fig:tsne_instag']} depicts the semantic diversity in a t-SNE plot comparing subsets from GraphFilter and InsTag.
  • Figure 3: Performance gap ($\Delta_{\textrm{all}}$) with respect to $\mu_{\textsc{all}}$, comparing SuperFilter, InsTag, and GraphFilter against Random, across various data selection budgets.
  • Figure 4: The prompt used for AlpaGasus annotation.