Table of Contents
Fetching ...

Realizing Unaligned Block-wise Pruning for DNN Acceleration on Mobile Devices

Hayun Lee, Dongkun Shin

TL;DR

This work tackles the practical barriers to unaligned block-wise pruning (UBP) for on-device DNN acceleration. It introduces BED, a fast block expansion and division algorithm, and WROS, a weight-rotating, output-stationary dataflow, to enable near-optimal pruning during training and efficient inference on mobile CPUs. Empirical results on MobileNetV1 and ResNet50 demonstrate that UBP with BED can surpass ABP in accuracy while maintaining similar latency, and that WROS brings UBP kernel performance close to ABP on real devices. The findings suggest that UBP, when paired with the proposed techniques, provides a viable and effective path for sparse neural network acceleration on resource-constrained mobile hardware.

Abstract

With the recent proliferation of on-device AI, there is an increasing need to run computationally intensive DNNs directly on mobile devices. However, the limited computing and memory resources of these devices necessitate effective pruning techniques. Block-wise pruning is promising due to its low accuracy drop tradeoff for speedup gains, but it requires block positions to be aligned with block size, hindering optimal position selection to minimize model accuracy drop. Unaligned block pruning (UBP) addresses this by allowing blocks to be selected at arbitrary positions, yet its practical use is limited by a time-consuming optimal block selection algorithm and lack of efficient inference kernels. In this paper, we propose a pseudo-optimal yet fast block selection algorithm called Block Expansion and Division (BED), which can be integrated into an iterative model training process. Additionally, we introduce an efficient inference kernel implementation for mobile devices, enabling a UBP-based model to achieve similar latency to a DNN model compressed by aligned block pruning. We demonstrate the superiority of our techniques on a real mobile phone with MobileNet and ResNet models.

Realizing Unaligned Block-wise Pruning for DNN Acceleration on Mobile Devices

TL;DR

This work tackles the practical barriers to unaligned block-wise pruning (UBP) for on-device DNN acceleration. It introduces BED, a fast block expansion and division algorithm, and WROS, a weight-rotating, output-stationary dataflow, to enable near-optimal pruning during training and efficient inference on mobile CPUs. Empirical results on MobileNetV1 and ResNet50 demonstrate that UBP with BED can surpass ABP in accuracy while maintaining similar latency, and that WROS brings UBP kernel performance close to ABP on real devices. The findings suggest that UBP, when paired with the proposed techniques, provides a viable and effective path for sparse neural network acceleration on resource-constrained mobile hardware.

Abstract

With the recent proliferation of on-device AI, there is an increasing need to run computationally intensive DNNs directly on mobile devices. However, the limited computing and memory resources of these devices necessitate effective pruning techniques. Block-wise pruning is promising due to its low accuracy drop tradeoff for speedup gains, but it requires block positions to be aligned with block size, hindering optimal position selection to minimize model accuracy drop. Unaligned block pruning (UBP) addresses this by allowing blocks to be selected at arbitrary positions, yet its practical use is limited by a time-consuming optimal block selection algorithm and lack of efficient inference kernels. In this paper, we propose a pseudo-optimal yet fast block selection algorithm called Block Expansion and Division (BED), which can be integrated into an iterative model training process. Additionally, we introduce an efficient inference kernel implementation for mobile devices, enabling a UBP-based model to achieve similar latency to a DNN model compressed by aligned block pruning. We demonstrate the superiority of our techniques on a real mobile phone with MobileNet and ResNet models.
Paper Structure (22 sections, 5 equations, 9 figures, 1 table, 2 algorithms)

This paper contains 22 sections, 5 equations, 9 figures, 1 table, 2 algorithms.

Figures (9)

  • Figure 1: Various pruning patterns.
  • Figure 2: Greedy block selection examples in UBP.
  • Figure 3: Overlapped output tiles problem in UBP kernel and naïve implementation.
  • Figure 4: Example of block expansion and division.
  • Figure 5: Sparse format and microkernel execution for weight rotating and output stationary dataflow.
  • ...and 4 more figures