GIST: Greedy Independent Set Thresholding for Max-Min Diversification with Submodular Utility

Matthew Fahrbach; Srikumar Ramalingam; Morteza Zadimoghaddam; Sara Ahmadian; Gui Citovsky; Giulia DeSalvo

GIST: Greedy Independent Set Thresholding for Max-Min Diversification with Submodular Utility

Matthew Fahrbach, Srikumar Ramalingam, Morteza Zadimoghaddam, Sara Ahmadian, Gui Citovsky, Giulia DeSalvo

TL;DR

The paper introduces MDMS, a subset selection problem that combines a monotone submodular utility with a max-min diversification term, formalized as maximizing $f(S)=g(S)+\lambda\cdot\text{div}(S)$ under a cardinality constraint. It proposes the GIST algorithm, which achieves a $\tfrac{1}{2}-\varepsilon$ approximation by sweeping distance thresholds and solving bicriteria maximum-weight independent-set problems on intersection graphs, with a stronger $\tfrac{2}{3}-\varepsilon$ guarantee for linear utilities; it also establishes hardness results, including a $0.5584$-approximation barrier for general metrics and APX-completeness for the Euclidean case with linear utility. The paper further strengthens the theory with a warm-up $0.387$-approximation and analyzes the linear-utility setting in depth, providing matching hardness results. Empirically, GIST outperforms state-of-the-art baselines on synthetic tasks and improves single-shot data sampling for ImageNet, demonstrating practical benefits for data summarization and training set curation.

Abstract

This work studies a novel subset selection problem called max-min diversification with monotone submodular utility ($\textsf{MDMS}$), which has a wide range of applications in machine learning, e.g., data sampling and feature selection. Given a set of points in a metric space, the goal of $\textsf{MDMS}$ is to maximize $f(S) = g(S) + λ\cdot \texttt{div}(S)$ subject to a cardinality constraint $|S| \le k$, where $g(S)$ is a monotone submodular function and $\texttt{div}(S) = \min_{u,v \in S : u \ne v} \text{dist}(u,v)$ is the max-min diversity objective. We propose the $\texttt{GIST}$ algorithm, which gives a $\frac{1}{2}$-approximation guarantee for $\textsf{MDMS}$ by approximating a series of maximum independent set problems with a bicriteria greedy algorithm. We also prove that it is NP-hard to approximate within a factor of $0.5584$. Finally, we show in our empirical study that $\texttt{GIST}$ outperforms state-of-the-art benchmarks for a single-shot data sampling task on ImageNet.

GIST: Greedy Independent Set Thresholding for Max-Min Diversification with Submodular Utility

TL;DR

The paper introduces MDMS, a subset selection problem that combines a monotone submodular utility with a max-min diversification term, formalized as maximizing

under a cardinality constraint. It proposes the GIST algorithm, which achieves a

approximation by sweeping distance thresholds and solving bicriteria maximum-weight independent-set problems on intersection graphs, with a stronger

guarantee for linear utilities; it also establishes hardness results, including a

-approximation barrier for general metrics and APX-completeness for the Euclidean case with linear utility. The paper further strengthens the theory with a warm-up

-approximation and analyzes the linear-utility setting in depth, providing matching hardness results. Empirically, GIST outperforms state-of-the-art baselines on synthetic tasks and improves single-shot data sampling for ImageNet, demonstrating practical benefits for data summarization and training set curation.

Abstract

This work studies a novel subset selection problem called max-min diversification with monotone submodular utility (

), which has a wide range of applications in machine learning, e.g., data sampling and feature selection. Given a set of points in a metric space, the goal of

is to maximize

subject to a cardinality constraint

, where

is a monotone submodular function and

is the max-min diversity objective. We propose the

algorithm, which gives a

-approximation guarantee for

by approximating a series of maximum independent set problems with a bicriteria greedy algorithm. We also prove that it is NP-hard to approximate within a factor of

. Finally, we show in our empirical study that

outperforms state-of-the-art benchmarks for a single-shot data sampling task on ImageNet.

Paper Structure (41 sections, 14 theorems, 50 equations, 3 figures, 1 table)

This paper contains 41 sections, 14 theorems, 50 equations, 3 figures, 1 table.

Introduction
Our contributions
Related work
Submodular maximization.
Diversity maximization.
Data sampling.
Preliminaries
Submodular function.
Max-min diversity.
MDMS problem statement.
Intersection graph.
Warm-up: Simple $0.387$-approximation algorithm
Algorithm
Linear utility
Hardness of approximation
...and 26 more sections

Key Result

Theorem 3.1

For any $\varepsilon > 0$, GIST outputs a set $S \subseteq V$ with $|S| \le k$ and $f(S) \ge (1/2 - \varepsilon) \cdot \textnormal{OPT}$ using $O(nk \log_{1+\varepsilon}(1/\varepsilon))$ submodular value oracle queries.

Figures (3)

Figure 1: $f(S_{\text{ALG}})$ for baseline methods and $\textnormal{GIST}\xspace$, for each cardinality constraint $k \in [n]$, on synthetic data with $n = 1000$, $\alpha = 0.95$, and $\beta = 0.75$.
Figure 2: Baseline comparison with $\alpha \in (0.85, 0.90, 0.95, 1.00)$ and $\beta = 0.75$.
Figure 3: Baseline comparison with $\alpha = 0.95$ and $\beta \in (0.60, 0.70, 0.80, 0.90)$.

Theorems & Definitions (24)

Remark 2.1
Theorem 3.1
Lemma 3.2
proof
proof : Proof of \ref{['thm:submod_main_approx_theorem']}
Theorem 3.3
Theorem 4.1
proof : Proof sketch
Theorem 4.2
Lemma 4.3: alimonti2000some
...and 14 more

GIST: Greedy Independent Set Thresholding for Max-Min Diversification with Submodular Utility

TL;DR

Abstract

GIST: Greedy Independent Set Thresholding for Max-Min Diversification with Submodular Utility

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (24)