Table of Contents
Fetching ...

Explore and Establish Synergistic Effects Between Weight Pruning and Coreset Selection in Neural Network Training

Weilin Wan, Fan Yi, Weizhong Zhang, Quan Zhou, Cheng Jin

TL;DR

This work tackles the computational burden of training deep neural networks by studying the interaction between weight pruning and coreset selection. It proposes SWaST, a joint optimization framework that alternates pruning and subset selection and adds a state-preservation constraint to stabilize training, avoiding the double-loss instability. Experiments across standard benchmarks show strong pruning-coreset synergy with up to 17.83% accuracy gains and substantial FLOP reductions, along with improved noise robustness. The approach offers practical benefits for efficient, robust deep learning on resource-constrained platforms.

Abstract

Modern deep neural networks rely heavily on massive model weights and training samples, incurring substantial computational costs. Weight pruning and coreset selection are two emerging paradigms proposed to improve computational efficiency. In this paper, we first explore the interplay between redundant weights and training samples through a transparent analysis: redundant samples, particularly noisy ones, cause model weights to become unnecessarily overtuned to fit them, complicating the identification of irrelevant weights during pruning; conversely, irrelevant weights tend to overfit noisy data, undermining coreset selection effectiveness. To further investigate and harness this interplay in deep learning, we develop a Simultaneous Weight and Sample Tailoring mechanism (SWaST) that alternately performs weight pruning and coreset selection to establish a synergistic effect in training. During this investigation, we observe that when simultaneously removing a large number of weights and samples, a phenomenon we term critical double-loss can occur, where important weights and their supportive samples are mistakenly eliminated at the same time, leading to model instability and nearly irreversible degradation that cannot be recovered in subsequent training. Unlike classic machine learning models, this issue can arise in deep learning due to the lack of theoretical guarantees on the correctness of weight pruning and coreset selection, which explains why these paradigms are often developed independently. We mitigate this by integrating a state preservation mechanism into SWaST, enabling stable joint optimization. Extensive experiments reveal a strong synergy between pruning and coreset selection across varying prune rates and coreset sizes, delivering accuracy boosts of up to 17.83% alongside 10% to 90% FLOPs reductions.

Explore and Establish Synergistic Effects Between Weight Pruning and Coreset Selection in Neural Network Training

TL;DR

This work tackles the computational burden of training deep neural networks by studying the interaction between weight pruning and coreset selection. It proposes SWaST, a joint optimization framework that alternates pruning and subset selection and adds a state-preservation constraint to stabilize training, avoiding the double-loss instability. Experiments across standard benchmarks show strong pruning-coreset synergy with up to 17.83% accuracy gains and substantial FLOP reductions, along with improved noise robustness. The approach offers practical benefits for efficient, robust deep learning on resource-constrained platforms.

Abstract

Modern deep neural networks rely heavily on massive model weights and training samples, incurring substantial computational costs. Weight pruning and coreset selection are two emerging paradigms proposed to improve computational efficiency. In this paper, we first explore the interplay between redundant weights and training samples through a transparent analysis: redundant samples, particularly noisy ones, cause model weights to become unnecessarily overtuned to fit them, complicating the identification of irrelevant weights during pruning; conversely, irrelevant weights tend to overfit noisy data, undermining coreset selection effectiveness. To further investigate and harness this interplay in deep learning, we develop a Simultaneous Weight and Sample Tailoring mechanism (SWaST) that alternately performs weight pruning and coreset selection to establish a synergistic effect in training. During this investigation, we observe that when simultaneously removing a large number of weights and samples, a phenomenon we term critical double-loss can occur, where important weights and their supportive samples are mistakenly eliminated at the same time, leading to model instability and nearly irreversible degradation that cannot be recovered in subsequent training. Unlike classic machine learning models, this issue can arise in deep learning due to the lack of theoretical guarantees on the correctness of weight pruning and coreset selection, which explains why these paradigms are often developed independently. We mitigate this by integrating a state preservation mechanism into SWaST, enabling stable joint optimization. Extensive experiments reveal a strong synergy between pruning and coreset selection across varying prune rates and coreset sizes, delivering accuracy boosts of up to 17.83% alongside 10% to 90% FLOPs reductions.

Paper Structure

This paper contains 65 sections, 8 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: The impact of redundant weights and samples on pruning and coreset selection. The collapsed pruned model ( yellow curve) in (a) compared to (b) implies that noisy/redundant data increases the difficulty of pruning. (c) shows selection difficulty $\mathcal{I}(\mathcal{D}, \hat{\mathcal{D}})$ rises with polynomial degree, indicating harder coreset selection.
  • Figure 2: Overview of SWaST, alternating between pruning and coreset selection every $\mathcal{R}$ epochs. $\mathcal{D}$ denotes the full dataset, $\hat{\mathcal{D}}$ denotes the selected subset, $\tilde{\mathcal{D}}$ denotes the stored logits, and $T$ denotes the total epochs.
  • Figure 3: Illustration of the “critical double-loss" phenomenon in concurrent weight pruning and sample tailoring: (1) Standard Training shows the close link between Data $A$ and Param $a$. (2) Pruning Only allows recovery of Param $a$ using Data $A$ when mistakenly pruned. (3) Coreset Only supports re-selection of Data $A$ through Param $a$ when excluded in error. (4) Pruning & Coreset causes irreversible degradation from simultaneous exclusion.
  • Figure 4: Efficiency vs. accuracy trade-offs (bubble area $\propto$ training time), where the upper-left region indicates optimal performance (high accuracy with low FLOPs).
  • Figure 5: Results demonstrating the effectiveness of SWaST (lower values are better in (a) and (b), higher values are better in (c)). (a) Noise ratio in selected coresets across training rounds comparing SWaST to the coreset-only baseline, with a 10.62% reduction in the final selection. (b) Overfitting comparison measured by test loss minus validation loss. SWaST significantly reduces overfitting compared to the dense model (Coreset-only), with higher prune rates yielding progressively better overfitting reduction. (c) Loss ratio between noisy and clean samples during training, showing that weight pruning increases the loss on noisy samples while preserving performance on clean data.
  • ...and 2 more figures

Theorems & Definitions (3)

  • Remark 1
  • Remark 2
  • Remark 3