Table of Contents
Fetching ...

D2 Pruning: Message Passing for Balancing Diversity and Difficulty in Data Pruning

Adyasha Maharana, Prateek Yadav, Mohit Bansal

TL;DR

D^2 Pruning introduces a graph-based, message-passing framework to select coresets by jointly balancing example difficulty and data diversity. The method constructs a sparse dataset graph, updates node difficulty through forward message passing, and promotes diversity via reverse message passing during iterative sampling. It delivers state-of-the-art gains at low-to-medium pruning rates on vision and NLP benchmarks and extends naturally to self-supervised and unsupervised data curation, including large multimodal data filtering. By enabling plug-and-play adjustment of factors beyond difficulty and diversity, it opens avenues for richer, more robust data pruning strategies.

Abstract

Analytical theories suggest that higher-quality data can lead to lower test errors in models trained on a fixed data budget. Moreover, a model can be trained on a lower compute budget without compromising performance if a dataset can be stripped of its redundancies. Coreset selection (or data pruning) seeks to select a subset of the training data so as to maximize the performance of models trained on this subset, also referred to as coreset. There are two dominant approaches: (1) geometry-based data selection for maximizing data diversity in the coreset, and (2) functions that assign difficulty scores to samples based on training dynamics. Optimizing for data diversity leads to a coreset that is biased towards easier samples, whereas, selection by difficulty ranking omits easy samples that are necessary for the training of deep learning models. This demonstrates that data diversity and importance scores are two complementary factors that need to be jointly considered during coreset selection. We represent a dataset as an undirected graph and propose a novel pruning algorithm, D2 Pruning, that uses forward and reverse message passing over this dataset graph for coreset selection. D2 Pruning updates the difficulty scores of each example by incorporating the difficulty of its neighboring examples in the dataset graph. Then, these updated difficulty scores direct a graph-based sampling method to select a coreset that encapsulates both diverse and difficult regions of the dataset space. We evaluate supervised and self-supervised versions of our method on various vision and language datasets. Results show that D2 Pruning improves coreset selection over previous state-of-the-art methods for up to 70% pruning rates. Additionally, we find that using D2 Pruning for filtering large multimodal datasets leads to increased diversity in the dataset and improved generalization of pretrained models.

D2 Pruning: Message Passing for Balancing Diversity and Difficulty in Data Pruning

TL;DR

D^2 Pruning introduces a graph-based, message-passing framework to select coresets by jointly balancing example difficulty and data diversity. The method constructs a sparse dataset graph, updates node difficulty through forward message passing, and promotes diversity via reverse message passing during iterative sampling. It delivers state-of-the-art gains at low-to-medium pruning rates on vision and NLP benchmarks and extends naturally to self-supervised and unsupervised data curation, including large multimodal data filtering. By enabling plug-and-play adjustment of factors beyond difficulty and diversity, it opens avenues for richer, more robust data pruning strategies.

Abstract

Analytical theories suggest that higher-quality data can lead to lower test errors in models trained on a fixed data budget. Moreover, a model can be trained on a lower compute budget without compromising performance if a dataset can be stripped of its redundancies. Coreset selection (or data pruning) seeks to select a subset of the training data so as to maximize the performance of models trained on this subset, also referred to as coreset. There are two dominant approaches: (1) geometry-based data selection for maximizing data diversity in the coreset, and (2) functions that assign difficulty scores to samples based on training dynamics. Optimizing for data diversity leads to a coreset that is biased towards easier samples, whereas, selection by difficulty ranking omits easy samples that are necessary for the training of deep learning models. This demonstrates that data diversity and importance scores are two complementary factors that need to be jointly considered during coreset selection. We represent a dataset as an undirected graph and propose a novel pruning algorithm, D2 Pruning, that uses forward and reverse message passing over this dataset graph for coreset selection. D2 Pruning updates the difficulty scores of each example by incorporating the difficulty of its neighboring examples in the dataset graph. Then, these updated difficulty scores direct a graph-based sampling method to select a coreset that encapsulates both diverse and difficult regions of the dataset space. We evaluate supervised and self-supervised versions of our method on various vision and language datasets. Results show that D2 Pruning improves coreset selection over previous state-of-the-art methods for up to 70% pruning rates. Additionally, we find that using D2 Pruning for filtering large multimodal datasets leads to increased diversity in the dataset and improved generalization of pretrained models.
Paper Structure (43 sections, 5 equations, 5 figures, 6 tables)

This paper contains 43 sections, 5 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Overview of $\mathbb{D}^2$Pruning. (left) Our proposed algorithm contains three steps: (a) Initialization of graph $\mathcal{G}$ using difficulty scores and edge weights based on embedding distance, (b) message passing between connected nodes to propagate difficulty scores of neighboring samples, and (c) data selection and reverse message passing to avoid sampling from the same neighborhood. (right) $\mathbb{D}^2$Pruning selects a balanced subset of samples (red) from sparse and dense regions.
  • Figure 2: Sampling Methods. Demonstration of data distribution (left) and importance scores (right) in (a) a single class in the CIFAR10 dataset, and coresets selected under 90% pruning rate via (b) random sampling, (c) greedy $k$-center selection that maximizes data diversity, (d) moderate coreset xia2023moderate (e) graph-based density sampling using embedding distance ebert2012ralf and (f) our method, $\mathbb{D}^2$Pruning, designed to balance data diversity and difficulty during coreset selection. Embeddings are extracted from a ResNet18 model trained on CIFAR10.
  • Figure 3: Effect of $k$, $\gamma_{r}$. (A) Accuracy at 30%, 90% pruning of CIFAR100 for nearest neighbors ($k$) and message passing weight $\gamma_{r}$ values; Distribution of difficulty scores in the best coresets selected via $\mathbb{D}^2$Pruning for 30% (center) and 70% (right) pruning of (B) CIFAR100, (C) ImageNet-1K.
  • Figure 4: Results of self-supervised pruning methods on ImageNet-1K. $\mathbb{D}^2$Pruning performs as good as the best supervised pruning method at 30% pruning rate and significantly improves over other self-supervised methods.
  • Figure 5: Example of coresets selected by $\mathbb{D}^2$Pruning from ImageNet-1K at 30% pruning rate. Image sub-populations are extracted from ImageNet-1K by a recursive traversal of the connectivity graph $\mathcal{G}$ initialized for $\mathbb{D}^2$Pruning. For each sub-population, we show the images retained in the coreset with ✓ and the images left out of the coreset with X.