Table of Contents
Fetching ...

On Distributed Larger-Than-Memory Subset Selection With Pairwise Submodular Functions

Maximilian Böther, Abraham Sebastian, Pranjal Awasthi, Ana Klimovic, Srikumar Ramalingam

TL;DR

The paper tackles subset selection for billion-scale datasets where full-data training is impractical. It introduces a distributed bounding framework that tightens the minimum and maximum utilities to prune the ground set, followed by a multi-round, partition-based distributed greedy to complete the subset when bounding is incomplete, all without requiring a central machine to hold the final subset. The method optimizes pairwise submodular objectives of the form $f(S)=\alpha\sum_{v\in S}u(v)-\beta\sum_{(v_1,v_2)\in E;v_1,v_2\in S}s(v_1,v_2)$, with both exact and approximate bounding guarantees and adaptive, multi-round partitioning for the final selection. Empirically, the approach achieves near-centralized quality on CIFAR-100 and ImageNet and scales to datasets with up to 13B points, using Apache Beam for scalable implementation and enabling practical large-scale pretraining without centralized memory bottlenecks.

Abstract

Modern datasets span billions of samples, making training on all available data infeasible. Selecting a high quality subset helps in reducing training costs and enhancing model quality. Submodularity, a discrete analogue of convexity, is commonly used for solving such subset selection problems. However, existing algorithms for optimizing submodular functions are sequential, and the prior distributed methods require at least one central machine to fit the target subset in DRAM. At billion datapoint scale, even the subset may not fit a single machine, and the sequential algorithms are prohibitively slow. In this paper, we relax the requirement of having a central machine for the target subset by proposing a novel distributed bounding algorithm with provable approximation guarantees. The algorithm iteratively bounds the minimum and maximum utility values to select high quality points and discard the unimportant ones. When bounding does not find the complete subset, we use a multi-round, partition-based distributed greedy algorithm to identify the remaining subset. We discuss how to implement these algorithms in a distributed data processing framework and empirically analyze different configurations. We find high quality subsets on CIFAR-100 and ImageNet with marginal or no loss in quality compared to centralized methods, and scale to a dataset with 13 billion points.

On Distributed Larger-Than-Memory Subset Selection With Pairwise Submodular Functions

TL;DR

The paper tackles subset selection for billion-scale datasets where full-data training is impractical. It introduces a distributed bounding framework that tightens the minimum and maximum utilities to prune the ground set, followed by a multi-round, partition-based distributed greedy to complete the subset when bounding is incomplete, all without requiring a central machine to hold the final subset. The method optimizes pairwise submodular objectives of the form , with both exact and approximate bounding guarantees and adaptive, multi-round partitioning for the final selection. Empirically, the approach achieves near-centralized quality on CIFAR-100 and ImageNet and scales to datasets with up to 13B points, using Apache Beam for scalable implementation and enabling practical large-scale pretraining without centralized memory bottlenecks.

Abstract

Modern datasets span billions of samples, making training on all available data infeasible. Selecting a high quality subset helps in reducing training costs and enhancing model quality. Submodularity, a discrete analogue of convexity, is commonly used for solving such subset selection problems. However, existing algorithms for optimizing submodular functions are sequential, and the prior distributed methods require at least one central machine to fit the target subset in DRAM. At billion datapoint scale, even the subset may not fit a single machine, and the sequential algorithms are prohibitively slow. In this paper, we relax the requirement of having a central machine for the target subset by proposing a novel distributed bounding algorithm with provable approximation guarantees. The algorithm iteratively bounds the minimum and maximum utility values to select high quality points and discard the unimportant ones. When bounding does not find the complete subset, we use a multi-round, partition-based distributed greedy algorithm to identify the remaining subset. We discuss how to implement these algorithms in a distributed data processing framework and empirically analyze different configurations. We find high quality subsets on CIFAR-100 and ImageNet with marginal or no loss in quality compared to centralized methods, and scale to a dataset with 13 billion points.
Paper Structure (22 sections, 4 theorems, 19 equations, 15 figures, 4 tables, 6 algorithms)

This paper contains 22 sections, 4 theorems, 19 equations, 15 figures, 4 tables, 6 algorithms.

Key Result

Lemma 4.3

For $v\in V$, if $U_{\text{min}}(v) > U_{\text{max}}^k$, then $v\in S^*$.

Figures (15)

  • Figure 1: Visualization of distributed bounding when finding a 50 % subset for 6 data points.
  • Figure 2: Visualization of the distributed submodular algorithm finding a subset of size 3 out of 10 points using 2 rounds with 3 partitions. The partitioning is given by color, the selected points per partition are marked with a red border, and the numbers represent IDs.
  • Figure 3: Normalized scores for finding a 10 % subset on CIFAR-100, depending on the number of partitions, rounds, and $\alpha$. The full version can be found in \ref{['fig:partition-rounds-cifar-full']} (CIFAR) and \ref{['fig:partition-rounds-imagenet-full']} (ImageNet). Here, $100$ denotes the quality of centralized greedy algorithm.
  • Figure 4: Normalized scores for finding a 10 % subset on CIFAR-100, depending on the number of partitions, rounds, and $\alpha$, using adaptive partitioning. The full version can be found in \ref{['fig:adaptive-rounds-cifar-full']} (CIFAR) and \ref{['fig:adaptive-rounds-imagenet-full']} (ImageNet).
  • Figure 5: A rasterized visualization of the chosen 5 000 points out of the 50 000 points in CIFAR-100. The points are colored by label, and chosen data points are depicted as black.
  • ...and 10 more figures

Theorems & Definitions (10)

  • Definition 3.1: Submodularity
  • Definition 3.2: Monotonicity
  • Definition 4.1: Minimum Utility
  • Definition 4.2: Maximum Utility
  • Lemma 4.3
  • Lemma 4.4
  • Definition 4.5: Expected Utility
  • Theorem 4.6
  • proof
  • Lemma 2.1