Table of Contents
Fetching ...

UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer

Ji Liu, Dehua Tang, Yuanxian Huang, Li Zhang, Xiaocheng Zeng, Dong Li, Mingjie Lu, Jinzhang Peng, Yu Wang, Fan Jiang, Lu Tian, Ashish Sirasao

TL;DR

The paper tackles pruning along the depth dimension for both CNNs and vision transformers, addressing issues with activation removal and normalization layers that hinder prior depth-pruning approaches. It introduces the Unified Progressive Depth Pruner, a four-stage framework (supernet training, subnet searching, progressive subnet training, and subnet merging via reparameterization) and a novel block-pruning strategy that converts complex blocks into simpler, mergeable forms while enabling BN fusion. The method extends to ViTs (e.g., DeiT) by modifying LN/GELU handling and residual connections, achieving state-of-the-art pruning performance on ConvNeXtV1 and competitive results on DeiT, with substantial speedups on AMD hardware. Overall, the approach improves hardware utilization and practical inference speed while preserving accuracy across diverse architectures, enabling more efficient deployment of both CNNs and vision transformers.

Abstract

Traditional channel-wise pruning methods by reducing network channels struggle to effectively prune efficient CNN models with depth-wise convolutional layers and certain efficient modules, such as popular inverted residual blocks. Prior depth pruning methods by reducing network depths are not suitable for pruning some efficient models due to the existence of some normalization layers. Moreover, finetuning subnet by directly removing activation layers would corrupt the original model weights, hindering the pruned model from achieving high performance. To address these issues, we propose a novel depth pruning method for efficient models. Our approach proposes a novel block pruning strategy and progressive training method for the subnet. Additionally, we extend our pruning method to vision transformer models. Experimental results demonstrate that our method consistently outperforms existing depth pruning methods across various pruning configurations. We obtained three pruned ConvNeXtV1 models with our method applying on ConvNeXtV1, which surpass most SOTA efficient models with comparable inference performance. Our method also achieves state-of-the-art pruning performance on the vision transformer model.

UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer

TL;DR

The paper tackles pruning along the depth dimension for both CNNs and vision transformers, addressing issues with activation removal and normalization layers that hinder prior depth-pruning approaches. It introduces the Unified Progressive Depth Pruner, a four-stage framework (supernet training, subnet searching, progressive subnet training, and subnet merging via reparameterization) and a novel block-pruning strategy that converts complex blocks into simpler, mergeable forms while enabling BN fusion. The method extends to ViTs (e.g., DeiT) by modifying LN/GELU handling and residual connections, achieving state-of-the-art pruning performance on ConvNeXtV1 and competitive results on DeiT, with substantial speedups on AMD hardware. Overall, the approach improves hardware utilization and practical inference speed while preserving accuracy across diverse architectures, enabling more efficient deployment of both CNNs and vision transformers.

Abstract

Traditional channel-wise pruning methods by reducing network channels struggle to effectively prune efficient CNN models with depth-wise convolutional layers and certain efficient modules, such as popular inverted residual blocks. Prior depth pruning methods by reducing network depths are not suitable for pruning some efficient models due to the existence of some normalization layers. Moreover, finetuning subnet by directly removing activation layers would corrupt the original model weights, hindering the pruned model from achieving high performance. To address these issues, we propose a novel depth pruning method for efficient models. Our approach proposes a novel block pruning strategy and progressive training method for the subnet. Additionally, we extend our pruning method to vision transformer models. Experimental results demonstrate that our method consistently outperforms existing depth pruning methods across various pruning configurations. We obtained three pruned ConvNeXtV1 models with our method applying on ConvNeXtV1, which surpass most SOTA efficient models with comparable inference performance. Our method also achieves state-of-the-art pruning performance on the vision transformer model.
Paper Structure (12 sections, 5 equations, 2 figures, 6 tables)

This paper contains 12 sections, 5 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Performance vs. speedup on the ImageNet-1K. Our three pruned ConvNeXtV1 models surpass most SOTA efficient models on performance including RegNetY, RepVGG, VanillaNet, ConvNeXtV2, Swin-T, PVT, DeiT, EdgeViT, EfficientFormerV2, and FastViT.
  • Figure 2: Framework overview of our proposed depth pruner. Each pruned baseline block will gradually evolve into a smaller merged block to speedup and save memory. Four baselines are experimented, including three CNN-based networks (ResNet34, MobileNetV2 and ConvNeXtV1) and one vision transformer network (DeiT-Tiny).