Table of Contents
Fetching ...

MDP: Multidimensional Vision Model Pruning with Latency Constraint

Xinglong Sun, Barath Lakshmanan, Maying Shen, Shiyi Lan, Jingde Chen, Jose M. Alvarez

TL;DR

This work tackles the challenge of aggressively pruning large vision models under realistic latency constraints by identifying limitations of channel-only pruning and linear latency models, especially for transformers. It introduces Multi-Dimensional Pruning (MDP), which jointly optimizes across multiple granularities (channels, query/key, heads, embeddings, blocks) using a latency-aware MINLP guided by a latency lookup table $$ and a precise latency decomposition $$; block removal is handled via binary grouping to preserve information flow. Across CNN and Transformer benchmarks, MDP delivers state-of-the-art speedups with minimal accuracy loss, e.g., ResNet50 achieves a $28\%$ speed increase with a $+1.4$ Top-1 gain over HALP, and DEIT-Base outperforms Isomorphic by $37\%$ speedup with a $+0.7$ Top-1 gain; 3D detection on Nuscenes also benefits with $\times1.18$ speedup and improved mAP/NDS. The framework includes practical aspects such as CPU adaptation, LUT reuse, and ablations showing the essential roles of multi-granularity pruning (MGP) and multi-dimensional latency modeling (MDLM). Overall, MDP offers a robust, hardware-aware pruning paradigm that achieves substantial latency-accuracy gains across CNNs and transformers and across diverse vision tasks.

Abstract

Current structural pruning methods face two significant limitations: (i) they often limit pruning to finer-grained levels like channels, making aggressive parameter reduction challenging, and (ii) they focus heavily on parameter and FLOP reduction, with existing latency-aware methods frequently relying on simplistic, suboptimal linear models that fail to generalize well to transformers, where multiple interacting dimensions impact latency. In this paper, we address both limitations by introducing Multi-Dimensional Pruning (MDP), a novel paradigm that jointly optimizes across a variety of pruning granularities-including channels, query, key, heads, embeddings, and blocks. MDP employs an advanced latency modeling technique to accurately capture latency variations across all prunable dimensions, achieving an optimal balance between latency and accuracy. By reformulating pruning as a Mixed-Integer Nonlinear Program (MINLP), MDP efficiently identifies the optimal pruned structure across all prunable dimensions while respecting latency constraints. This versatile framework supports both CNNs and transformers. Extensive experiments demonstrate that MDP significantly outperforms previous methods, especially at high pruning ratios. On ImageNet, MDP achieves a 28% speed increase with a +1.4 Top-1 accuracy improvement over prior work like HALP for ResNet50 pruning. Against the latest transformer pruning method, Isomorphic, MDP delivers an additional 37% acceleration with a +0.7 Top-1 accuracy improvement.

MDP: Multidimensional Vision Model Pruning with Latency Constraint

TL;DR

This work tackles the challenge of aggressively pruning large vision models under realistic latency constraints by identifying limitations of channel-only pruning and linear latency models, especially for transformers. It introduces Multi-Dimensional Pruning (MDP), which jointly optimizes across multiple granularities (channels, query/key, heads, embeddings, blocks) using a latency-aware MINLP guided by a latency lookup table and a precise latency decomposition ; block removal is handled via binary grouping to preserve information flow. Across CNN and Transformer benchmarks, MDP delivers state-of-the-art speedups with minimal accuracy loss, e.g., ResNet50 achieves a speed increase with a Top-1 gain over HALP, and DEIT-Base outperforms Isomorphic by speedup with a Top-1 gain; 3D detection on Nuscenes also benefits with speedup and improved mAP/NDS. The framework includes practical aspects such as CPU adaptation, LUT reuse, and ablations showing the essential roles of multi-granularity pruning (MGP) and multi-dimensional latency modeling (MDLM). Overall, MDP offers a robust, hardware-aware pruning paradigm that achieves substantial latency-accuracy gains across CNNs and transformers and across diverse vision tasks.

Abstract

Current structural pruning methods face two significant limitations: (i) they often limit pruning to finer-grained levels like channels, making aggressive parameter reduction challenging, and (ii) they focus heavily on parameter and FLOP reduction, with existing latency-aware methods frequently relying on simplistic, suboptimal linear models that fail to generalize well to transformers, where multiple interacting dimensions impact latency. In this paper, we address both limitations by introducing Multi-Dimensional Pruning (MDP), a novel paradigm that jointly optimizes across a variety of pruning granularities-including channels, query, key, heads, embeddings, and blocks. MDP employs an advanced latency modeling technique to accurately capture latency variations across all prunable dimensions, achieving an optimal balance between latency and accuracy. By reformulating pruning as a Mixed-Integer Nonlinear Program (MINLP), MDP efficiently identifies the optimal pruned structure across all prunable dimensions while respecting latency constraints. This versatile framework supports both CNNs and transformers. Extensive experiments demonstrate that MDP significantly outperforms previous methods, especially at high pruning ratios. On ImageNet, MDP achieves a 28% speed increase with a +1.4 Top-1 accuracy improvement over prior work like HALP for ResNet50 pruning. Against the latest transformer pruning method, Isomorphic, MDP delivers an additional 37% acceleration with a +0.7 Top-1 accuracy improvement.

Paper Structure

This paper contains 19 sections, 6 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: MDP exhibits Pareto dominance with both CNNs and Transformers in tasks ranging from ImageNet classification to NuScenes 3D detection. Speedup are shown relative to the dense model.[Left] On ImageNet pruning ResNet50, we achieve a $28\%$ speed increase alongside a $+1.4$ improvement in Top-1 compared with prior art shen2021halp. [Middle] On ImageNet pruning DEIT-Base, compared with very recent Isomorphic Pruningfang2025isomorphic, our method further accelerates the baseline by an additional $37\%$ while yielding a $+0.7$ gain in Top-1. [Right] For 3D object detection, we obtain higher speed ($\times 1.18$) and mAP ($\mathbf{0.451}$ vs. $0.449$) compared to the dense baseline.
  • Figure 2: We begin by encoding prunable dimensions within the model with one-hot variables($\bm{\omega}$), followed by establishing an importance objective and a latency constraint for each value of $\bm{\omega}$ with prepared latency lookup table (LUT). Next, parameters are grouped by block, and an MINLP optimizes pruning across all dimensions under latency budget $\Psi$. Finally, we extract the pruned subnetwork and finetune it.
  • Figure 3: 2D Object Detection on Pascal VOC. Pruning SSD512. Speedup versus mAP are plotted(top-right is better). Speedup measured on NVIDIA TITANV, and is relative to the dense FPS.
  • Figure 4: Ablation study results on ImageNet with ResNet50. We show results of each improvement acting individually. Top-right is better. MGP: Multi-Granularity Pruning (Ours); MDLM: Multi-Dimensional Latency Modeling (Ours); OCP: Only Channel Pruning; LLM: Linear Latency Modeling
  • Figure 5: Comparison in latency modeling between ours and prior arts shen2021halphumble2022soft. Example with CNNs.
  • ...and 3 more figures