ASC: Adaptive Scale Feature Map Compression for Deep Neural Network

Yuan Yao; Tian-Sheuan Chang

ASC: Adaptive Scale Feature Map Compression for Deep Neural Network

Yuan Yao, Tian-Sheuan Chang

TL;DR

This paper tackles memory bandwidth and on-chip buffer bottlenecks caused by large feature maps in deep learning accelerators by proposing Adaptive Scale Feature Map Compression (ASC). ASC leverages independent channel indexing, a cubical-like block shape, similarity-based reordering, a switchable endpoint mode, and adaptive interpolation with two scales (a revised linear and a log-linear scale) to achieve up to $4\times$ constant-rate and up to $7.69\times$ variable-rate compression for 16-bit data, with near-lossless performance on several models. The authors implement ASC in a hardware-friendly 28nm design, achieving a 32x throughput increase with only a modest hardware cost (6135 gates for 8-bit) and demonstrating scalable interpolation through scale-shifting and threshold-based point selection. Across classification, segmentation, and super-resolution tasks, ASC shows substantial memory savings with controlled accuracy loss, and hardware results indicate favorable throughput-area-power scaling compared to prior approaches, making it suitable for resource-limited DL accelerators.

Abstract

Deep-learning accelerators are increasingly in demand; however, their performance is constrained by the size of the feature map, leading to high bandwidth requirements and large buffer sizes. We propose an adaptive scale feature map compression technique leveraging the unique properties of the feature map. This technique adopts independent channel indexing given the weak channel correlation and utilizes a cubical-like block shape to benefit from strong local correlations. The method further optimizes compression using a switchable endpoint mode and adaptive scale interpolation to handle unimodal data distributions, both with and without outliers. This results in 4$\times$ and up to 7.69$\times$ compression rates for 16-bit data in constant and variable bitrates, respectively. Our hardware design minimizes area cost by adjusting interpolation scales, which facilitates hardware sharing among interpolation points. Additionally, we introduce a threshold concept for straightforward interpolation, preventing the need for intricate hardware. The TSMC 28nm implementation showcases an equivalent gate count of 6135 for the 8-bit version. Furthermore, the hardware architecture scales effectively, with only a sublinear increase in area cost. Achieving a 32$\times$ throughput increase meets the theoretical bandwidth of DDR5-6400 at just 7.65$\times$ the hardware cost.

ASC: Adaptive Scale Feature Map Compression for Deep Neural Network

TL;DR

constant-rate and up to

variable-rate compression for 16-bit data, with near-lossless performance on several models. The authors implement ASC in a hardware-friendly 28nm design, achieving a 32x throughput increase with only a modest hardware cost (6135 gates for 8-bit) and demonstrating scalable interpolation through scale-shifting and threshold-based point selection. Across classification, segmentation, and super-resolution tasks, ASC shows substantial memory savings with controlled accuracy loss, and hardware results indicate favorable throughput-area-power scaling compared to prior approaches, making it suitable for resource-limited DL accelerators.

Abstract

and up to 7.69

compression rates for 16-bit data in constant and variable bitrates, respectively. Our hardware design minimizes area cost by adjusting interpolation scales, which facilitates hardware sharing among interpolation points. Additionally, we introduce a threshold concept for straightforward interpolation, preventing the need for intricate hardware. The TSMC 28nm implementation showcases an equivalent gate count of 6135 for the 8-bit version. Furthermore, the hardware architecture scales effectively, with only a sublinear increase in area cost. Achieving a 32

throughput increase meets the theoretical bandwidth of DDR5-6400 at just 7.65

the hardware cost.

Paper Structure (34 sections, 1 equation, 17 figures, 16 tables)

This paper contains 34 sections, 1 equation, 17 figures, 16 tables.

Introduction
Proposed Method
Review of S3TC
Challenges and Proposed Solutions
Overview of ASC
Channel Indexing and Reordering
Independent Channel Indexing
Similarity-based Reordering
Cubical-Like Block Shape
Switchable Endpoint Mode
Adaptive Scale Interpolation
Variable bitrate version: ASC-VBR
Hardware Implementation
Challenges and Solutions
Proposed ASC Hardware
...and 19 more sections

Figures (17)

Figure 1: (a) S3TC encoding process, (b) S3TC decoding process
Figure 2: Proposed ASC-CBR processes: (a) encoding and (b) decoding
Figure 3: (a) Similarity matrix for an image, (b) Similarity matrix for a feature map
Figure 4: The heuristic method to match two channels
Figure 5: (a) Smooth and image-like block, (b) Disjointed block with outliers, (c) Revised linear scale, (d) Log-linear scale
...and 12 more figures

ASC: Adaptive Scale Feature Map Compression for Deep Neural Network

TL;DR

Abstract

ASC: Adaptive Scale Feature Map Compression for Deep Neural Network

Authors

TL;DR

Abstract

Table of Contents

Figures (17)