Projection-Free CNN Pruning via Frank-Wolfe with Momentum: Sparser Models with Less Pretraining
Hamza ElMokhtar Shili, Natasha Patnaik, Isabelle Ruble, Kathryn Jarjoura, Daniel Suarez Aguirre
TL;DR
The paper investigates projection-free CNN pruning using Frank-Wolfe variants, including a momentum-enhanced approach, to identify sparse subnetworks with minimal pretraining. By framing pruning as a constrained optimization problem and evaluating on MNIST, the study shows that Frank-Wolfe with momentum yields sparser, more accurate networks than simple pruning and standard FW, with only modest inference-time overhead. A key finding is that substantial performance can be achieved after only 1–2 epochs of dense pretraining, suggesting full pretraining may be unnecessary in resource-constrained settings. The work highlights FW-based pruning as a practical tool for efficient CNN compression and sets direction for broader generalization and structured-pruning extensions.
Abstract
We investigate algorithmic variants of the Frank-Wolfe (FW) optimization method for pruning convolutional neural networks. This is motivated by the "Lottery Ticket Hypothesis", which suggests the existence of smaller sub-networks within larger pre-trained networks that perform comparatively well (if not better). Whilst most literature in this area focuses on Deep Neural Networks more generally, we specifically consider Convolutional Neural Networks for image classification tasks. Building on the hypothesis, we compare simple magnitude-based pruning, a Frank-Wolfe style pruning scheme, and an FW method with momentum on a CNN trained on MNIST. Our experiments track test accuracy, loss, sparsity, and inference time as we vary the dense pre-training budget from 1 to 10 epochs. We find that FW with momentum yields pruned networks that are both sparser and more accurate than the original dense model and the simple pruning baselines, while incurring minimal inference-time overhead in our implementation. Moreover, FW with momentum reaches these accuracies after only a few epochs of pre-training, indicating that full pre-training of the dense model is not required in this setting.
