Table of Contents
Fetching ...

ASC: Adaptive Scale Feature Map Compression for Deep Neural Network

Yuan Yao, Tian-Sheuan Chang

TL;DR

This paper tackles memory bandwidth and on-chip buffer bottlenecks caused by large feature maps in deep learning accelerators by proposing Adaptive Scale Feature Map Compression (ASC). ASC leverages independent channel indexing, a cubical-like block shape, similarity-based reordering, a switchable endpoint mode, and adaptive interpolation with two scales (a revised linear and a log-linear scale) to achieve up to $4\times$ constant-rate and up to $7.69\times$ variable-rate compression for 16-bit data, with near-lossless performance on several models. The authors implement ASC in a hardware-friendly 28nm design, achieving a 32x throughput increase with only a modest hardware cost (6135 gates for 8-bit) and demonstrating scalable interpolation through scale-shifting and threshold-based point selection. Across classification, segmentation, and super-resolution tasks, ASC shows substantial memory savings with controlled accuracy loss, and hardware results indicate favorable throughput-area-power scaling compared to prior approaches, making it suitable for resource-limited DL accelerators.

Abstract

Deep-learning accelerators are increasingly in demand; however, their performance is constrained by the size of the feature map, leading to high bandwidth requirements and large buffer sizes. We propose an adaptive scale feature map compression technique leveraging the unique properties of the feature map. This technique adopts independent channel indexing given the weak channel correlation and utilizes a cubical-like block shape to benefit from strong local correlations. The method further optimizes compression using a switchable endpoint mode and adaptive scale interpolation to handle unimodal data distributions, both with and without outliers. This results in 4$\times$ and up to 7.69$\times$ compression rates for 16-bit data in constant and variable bitrates, respectively. Our hardware design minimizes area cost by adjusting interpolation scales, which facilitates hardware sharing among interpolation points. Additionally, we introduce a threshold concept for straightforward interpolation, preventing the need for intricate hardware. The TSMC 28nm implementation showcases an equivalent gate count of 6135 for the 8-bit version. Furthermore, the hardware architecture scales effectively, with only a sublinear increase in area cost. Achieving a 32$\times$ throughput increase meets the theoretical bandwidth of DDR5-6400 at just 7.65$\times$ the hardware cost.

ASC: Adaptive Scale Feature Map Compression for Deep Neural Network

TL;DR

This paper tackles memory bandwidth and on-chip buffer bottlenecks caused by large feature maps in deep learning accelerators by proposing Adaptive Scale Feature Map Compression (ASC). ASC leverages independent channel indexing, a cubical-like block shape, similarity-based reordering, a switchable endpoint mode, and adaptive interpolation with two scales (a revised linear and a log-linear scale) to achieve up to constant-rate and up to variable-rate compression for 16-bit data, with near-lossless performance on several models. The authors implement ASC in a hardware-friendly 28nm design, achieving a 32x throughput increase with only a modest hardware cost (6135 gates for 8-bit) and demonstrating scalable interpolation through scale-shifting and threshold-based point selection. Across classification, segmentation, and super-resolution tasks, ASC shows substantial memory savings with controlled accuracy loss, and hardware results indicate favorable throughput-area-power scaling compared to prior approaches, making it suitable for resource-limited DL accelerators.

Abstract

Deep-learning accelerators are increasingly in demand; however, their performance is constrained by the size of the feature map, leading to high bandwidth requirements and large buffer sizes. We propose an adaptive scale feature map compression technique leveraging the unique properties of the feature map. This technique adopts independent channel indexing given the weak channel correlation and utilizes a cubical-like block shape to benefit from strong local correlations. The method further optimizes compression using a switchable endpoint mode and adaptive scale interpolation to handle unimodal data distributions, both with and without outliers. This results in 4 and up to 7.69 compression rates for 16-bit data in constant and variable bitrates, respectively. Our hardware design minimizes area cost by adjusting interpolation scales, which facilitates hardware sharing among interpolation points. Additionally, we introduce a threshold concept for straightforward interpolation, preventing the need for intricate hardware. The TSMC 28nm implementation showcases an equivalent gate count of 6135 for the 8-bit version. Furthermore, the hardware architecture scales effectively, with only a sublinear increase in area cost. Achieving a 32 throughput increase meets the theoretical bandwidth of DDR5-6400 at just 7.65 the hardware cost.
Paper Structure (34 sections, 1 equation, 17 figures, 16 tables)

This paper contains 34 sections, 1 equation, 17 figures, 16 tables.

Figures (17)

  • Figure 1: (a) S3TC encoding process, (b) S3TC decoding process
  • Figure 2: Proposed ASC-CBR processes: (a) encoding and (b) decoding
  • Figure 3: (a) Similarity matrix for an image, (b) Similarity matrix for a feature map
  • Figure 4: The heuristic method to match two channels
  • Figure 5: (a) Smooth and image-like block, (b) Disjointed block with outliers, (c) Revised linear scale, (d) Log-linear scale
  • ...and 12 more figures