MDP: Multidimensional Vision Model Pruning with Latency Constraint
Xinglong Sun, Barath Lakshmanan, Maying Shen, Shiyi Lan, Jingde Chen, Jose M. Alvarez
TL;DR
This work tackles the challenge of aggressively pruning large vision models under realistic latency constraints by identifying limitations of channel-only pruning and linear latency models, especially for transformers. It introduces Multi-Dimensional Pruning (MDP), which jointly optimizes across multiple granularities (channels, query/key, heads, embeddings, blocks) using a latency-aware MINLP guided by a latency lookup table $$ and a precise latency decomposition $$; block removal is handled via binary grouping to preserve information flow. Across CNN and Transformer benchmarks, MDP delivers state-of-the-art speedups with minimal accuracy loss, e.g., ResNet50 achieves a $28\%$ speed increase with a $+1.4$ Top-1 gain over HALP, and DEIT-Base outperforms Isomorphic by $37\%$ speedup with a $+0.7$ Top-1 gain; 3D detection on Nuscenes also benefits with $\times1.18$ speedup and improved mAP/NDS. The framework includes practical aspects such as CPU adaptation, LUT reuse, and ablations showing the essential roles of multi-granularity pruning (MGP) and multi-dimensional latency modeling (MDLM). Overall, MDP offers a robust, hardware-aware pruning paradigm that achieves substantial latency-accuracy gains across CNNs and transformers and across diverse vision tasks.
Abstract
Current structural pruning methods face two significant limitations: (i) they often limit pruning to finer-grained levels like channels, making aggressive parameter reduction challenging, and (ii) they focus heavily on parameter and FLOP reduction, with existing latency-aware methods frequently relying on simplistic, suboptimal linear models that fail to generalize well to transformers, where multiple interacting dimensions impact latency. In this paper, we address both limitations by introducing Multi-Dimensional Pruning (MDP), a novel paradigm that jointly optimizes across a variety of pruning granularities-including channels, query, key, heads, embeddings, and blocks. MDP employs an advanced latency modeling technique to accurately capture latency variations across all prunable dimensions, achieving an optimal balance between latency and accuracy. By reformulating pruning as a Mixed-Integer Nonlinear Program (MINLP), MDP efficiently identifies the optimal pruned structure across all prunable dimensions while respecting latency constraints. This versatile framework supports both CNNs and transformers. Extensive experiments demonstrate that MDP significantly outperforms previous methods, especially at high pruning ratios. On ImageNet, MDP achieves a 28% speed increase with a +1.4 Top-1 accuracy improvement over prior work like HALP for ResNet50 pruning. Against the latest transformer pruning method, Isomorphic, MDP delivers an additional 37% acceleration with a +0.7 Top-1 accuracy improvement.
