Comb, Prune, Distill: Towards Unified Pruning for Vision Model Compression
Jonas Schmitt, Ruiping Liu, Junwei Zheng, Jiaming Zhang, Rainer Stiefelhagen
TL;DR
CPD addresses the need for resource-efficient vision models by proposing a model- and task-agnostic pruning framework that unifies pruning with knowledge distillation. It introduces a Combing step to automatically resolve layer dependencies, a Hessian-based Pruning pipeline to select and remove channels, and a Distillation step to transfer knowledge from the full model to the pruned one. Empirically, CPD yields substantial speedups in classification (up to 4.31×) and notable latency reductions in segmentation (≈48% and 26%) with modest accuracy or mIoU losses, across CNNs and Vision Transformers on ImageNet and ADE20K. The work demonstrates broad generalization and practical impact for deploying compact models in resource-constrained settings, such as intelligent transportation systems and robotics, while outlining directions to extend the framework to other architectures and tasks.
Abstract
Lightweight and effective models are essential for devices with limited resources, such as intelligent vehicles. Structured pruning offers a promising approach to model compression and efficiency enhancement. However, existing methods often tie pruning techniques to specific model architectures or vision tasks. To address this limitation, we propose a novel unified pruning framework Comb, Prune, Distill (CPD), which addresses both model-agnostic and task-agnostic concerns simultaneously. Our framework employs a combing step to resolve hierarchical layer-wise dependency issues, enabling architecture independence. Additionally, the pruning pipeline adaptively remove parameters based on the importance scoring metrics regardless of vision tasks. To support the model in retaining its learned information, we introduce knowledge distillation during the pruning step. Extensive experiments demonstrate the generalizability of our framework, encompassing both convolutional neural network (CNN) and transformer models, as well as image classification and segmentation tasks. In image classification we achieve a speedup of up to x4.3 with a accuracy loss of 1.8% and in semantic segmentation up to x1.89 with a 5.1% loss in mIoU.
