Table of Contents
Fetching ...

COLT: Cyclic Overlapping Lottery Tickets for Faster Pruning of Convolutional Neural Networks

Md. Ismail Hossain, Mohammed Rakib, M. M. Lutfe Elahi, Nabeel Mohammed, Shafin Rahman

TL;DR

COLT introduces Cyclic Overlapping Lottery Tickets, a data-partitioned pruning framework that uses overlapping masks to derive highly sparse subnetworks in fewer pruning rounds while preserving accuracy. By training multiple models on non-overlapping class partitions and intersecting their pruned weights, COLT yields robust, transferable tickets that generalize across datasets and extend to object detection. Across CIFAR-10/100, Tiny ImageNet, ImageNet, and Pascal VOC, COLT achieves comparable performance to LTH at high sparsity but with significantly lower computation time and improved transferability from large to small datasets. This approach reduces training cost and energy while delivering practical, scalable sparse networks for CNNs.

Abstract

Pruning refers to the elimination of trivial weights from neural networks. The sub-networks within an overparameterized model produced after pruning are often called Lottery tickets. This research aims to generate winning lottery tickets from a set of lottery tickets that can achieve similar accuracy to the original unpruned network. We introduce a novel winning ticket called Cyclic Overlapping Lottery Ticket (COLT) by data splitting and cyclic retraining of the pruned network from scratch. We apply a cyclic pruning algorithm that keeps only the overlapping weights of different pruned models trained on different data segments. Our results demonstrate that COLT can achieve similar accuracies (obtained by the unpruned model) while maintaining high sparsities. We show that the accuracy of COLT is on par with the winning tickets of Lottery Ticket Hypothesis (LTH) and, at times, is better. Moreover, COLTs can be generated using fewer iterations than tickets generated by the popular Iterative Magnitude Pruning (IMP) method. In addition, we also notice COLTs generated on large datasets can be transferred to small ones without compromising performance, demonstrating its generalizing capability. We conduct all our experiments on Cifar-10, Cifar-100 & TinyImageNet datasets and report superior performance than the state-of-the-art methods.

COLT: Cyclic Overlapping Lottery Tickets for Faster Pruning of Convolutional Neural Networks

TL;DR

COLT introduces Cyclic Overlapping Lottery Tickets, a data-partitioned pruning framework that uses overlapping masks to derive highly sparse subnetworks in fewer pruning rounds while preserving accuracy. By training multiple models on non-overlapping class partitions and intersecting their pruned weights, COLT yields robust, transferable tickets that generalize across datasets and extend to object detection. Across CIFAR-10/100, Tiny ImageNet, ImageNet, and Pascal VOC, COLT achieves comparable performance to LTH at high sparsity but with significantly lower computation time and improved transferability from large to small datasets. This approach reduces training cost and energy while delivering practical, scalable sparse networks for CNNs.

Abstract

Pruning refers to the elimination of trivial weights from neural networks. The sub-networks within an overparameterized model produced after pruning are often called Lottery tickets. This research aims to generate winning lottery tickets from a set of lottery tickets that can achieve similar accuracy to the original unpruned network. We introduce a novel winning ticket called Cyclic Overlapping Lottery Ticket (COLT) by data splitting and cyclic retraining of the pruned network from scratch. We apply a cyclic pruning algorithm that keeps only the overlapping weights of different pruned models trained on different data segments. Our results demonstrate that COLT can achieve similar accuracies (obtained by the unpruned model) while maintaining high sparsities. We show that the accuracy of COLT is on par with the winning tickets of Lottery Ticket Hypothesis (LTH) and, at times, is better. Moreover, COLTs can be generated using fewer iterations than tickets generated by the popular Iterative Magnitude Pruning (IMP) method. In addition, we also notice COLTs generated on large datasets can be transferred to small ones without compromising performance, demonstrating its generalizing capability. We conduct all our experiments on Cifar-10, Cifar-100 & TinyImageNet datasets and report superior performance than the state-of-the-art methods.
Paper Structure (17 sections, 2 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 17 sections, 2 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: We deal with the problem of pruning an overparameterized network without compromising performance. Here, we visualize the initial weights of learned networks while training with TinyImageNet. (a) shows weights of the unpruned overparameterized network (no pruning), achieving an accuracy of 72.7%. The popular pruning method, (b) LTH can prune the same network to a sparsity of 88.8% (green grid regions) in 10 pruning rounds, achieving an accuracy of 71.42%. (c) In this paper, we propose a novel pruning method named COLT that can prune the network in (a) to a sparsity of 89.1% in 7 pruning rounds and achieve an accuracy of 72.8%. In comparison to LTH in (b), COLT can generate highly sparse winning subnetworks (tickets) in fewer iterations (7 vs. 10), maintaining similar accuracy.
  • Figure 2: Illustration of the advantage of our proposed COLT over the LTH-based pruning method while using architecture ResNet-18 on (Left) Cifar10 and (Right) Cifar-100 datasets. Compared to LTH, COLT can generate a highly sparse ticket in fewer rounds/iterations. We notice the same trend to achieve sparsity of different ranges, e.g., 0-70%, 0-90%, and 0-98%. In all cases, COLT maintains a similar accuracy to LTH while requiring fewer pruning rounds. Specifically, for (Left) ResNet-18 on Cifar-10, the sparsity 61.5%, 88.8% and 97.6% achieved an accuracy of 60.28%, 59.01% and 51.81% respectively. For (Right) ResNet-18 on Cifar100, the sparsity 61.7%, 89.1% and 97.4% achieved an accuracy of about 73.59%, 72.88% and 68.53%, respectively.
  • Figure 3: Transferability of tickets calculated from Partition 1, $m^{(1)}$, partition 2, $m^{(2)}$ and their overlap, $m = m^{(1)} \cap m^{(2)}$. Overlapping ticket, $m$ achieves a higher pruning rate (sparsity), maintaining similar accuracy to others. $m^{(1)}$ and $m^{(2)}$ get matching sparsity because we prune fixed $p\%$ low magnitude weight in every pruning round. In later pruning rounds, the behavior of all tickets is similar because every time the same $m$ calculated from the current round is transferred to both $\mathcal{F}_1$ and $\mathcal{F}_2$ for the next round making $\theta^{(1)}$ and $\theta^{(2)}$ similar.
  • Figure 4: A toy example. Initial random weights are updated after training. Using updated weights, we calculate $m^{(1)}$ and $m^{(2)}$ by assigning zero to the locations where (bold values) $p\%$ lower magnitude weights exist. Then, we calculate a generic ticket, $m = m^{(1)} \cap m^{(2)}$, that refers to the pruned subnetwork. Pruned initial weights for the subsequent pruning round are calculated by $m\odot\theta^{(1)}$ and $m\odot\theta^{(2)}$.
  • Figure 5: Visual illustration of COLT generation while $N$=2. (a) Two models, $\mathcal{F}_1$ & $\mathcal{F}_2$, of identical architectures, are first initialized with the same initial weights (blue). (b) Next, the two models are trained using the same set of hyperparameters to deduce the final weights (red & green). (c) After training, p% lowest magnitude final weights are pruned from each model. (red weights are the p% lowest magnitude weights while the green ones are above the threshold) (d) Then the models are rewound to their initial state of weights (blue, with the pruned weights being zero. (e) After rewinding, the models' overlapping weights (blue) are kept, and the rest are pruned. The result is a COLT ticket that can gain accuracies similar to or better than the original network.
  • ...and 3 more figures