JAQ: Joint Efficient Architecture Design and Low-Bit Quantization with Hardware-Software Co-Exploration

Mingzi Wang; Yuan Meng; Chen Tang; Weixiang Zhang; Yijian Qin; Yang Yao; Yingxin Li; Tongtong Feng; Xin Wang; Xun Guan; Zhi Wang; Wenwu Zhu

JAQ: Joint Efficient Architecture Design and Low-Bit Quantization with Hardware-Software Co-Exploration

Mingzi Wang, Yuan Meng, Chen Tang, Weixiang Zhang, Yijian Qin, Yang Yao, Yingxin Li, Tongtong Feng, Xin Wang, Xun Guan, Zhi Wang, Wenwu Zhu

TL;DR

JAQ tackles the challenge of deploying DNNs on resource-constrained devices by jointly optimizing neural architecture, ultra-low mixed-precision quantization, and accelerator design. The framework introduces Channel-wise Sparse Quantization (CSQ) to reduce memory during differentiable search and BatchTile to rapidly explore compiler mappings, enabling efficient joint co-exploration. The formal objective couples classification loss with a hardware-cost term via a constrained optimization, solved through a two-stage process that begins with search and ends with retraining the optimal subnet, achieving approximately a 7% gain in ImageNet Top-1 accuracy and a per-iteration search time of around $0.15$ seconds. These results demonstrate strong improvements over state-of-the-art co-design methods and offer a scalable path to edge deployment through software-hardware co-design.

Abstract

The co-design of neural network architectures, quantization precisions, and hardware accelerators offers a promising approach to achieving an optimal balance between performance and efficiency, particularly for model deployment on resource-constrained edge devices. In this work, we propose the JAQ Framework, which jointly optimizes the three critical dimensions. However, effectively automating the design process across the vast search space of those three dimensions poses significant challenges, especially when pursuing extremely low-bit quantization. Specifical, the primary challenges include: (1) Memory overhead in software-side: Low-precision quantization-aware training can lead to significant memory usage due to storing large intermediate features and latent weights for back-propagation, potentially causing memory exhaustion. (2) Search time-consuming in hardware-side: The discrete nature of hardware parameters and the complex interplay between compiler optimizations and individual operators make the accelerator search time-consuming. To address these issues, JAQ mitigates the memory overhead through a channel-wise sparse quantization (CSQ) scheme, selectively applying quantization to the most sensitive components of the model during optimization. Additionally, JAQ designs BatchTile, which employs a hardware generation network to encode all possible tiling modes, thereby speeding up the search for the optimal compiler mapping strategy. Extensive experiments demonstrate the effectiveness of JAQ, achieving approximately 7% higher Top-1 accuracy on ImageNet compared to previous methods and reducing the hardware search time per iteration to 0.15 seconds.

JAQ: Joint Efficient Architecture Design and Low-Bit Quantization with Hardware-Software Co-Exploration

TL;DR

seconds. These results demonstrate strong improvements over state-of-the-art co-design methods and offer a scalable path to edge deployment through software-hardware co-design.

Abstract

Paper Structure (31 sections, 9 equations, 5 figures, 7 tables)

This paper contains 31 sections, 9 equations, 5 figures, 7 tables.

Introduction
Related Work
Quantization and Neural Architecture Search
DNN Accelerators
Hardware-software Co-design
JAQ Framework
Preliminary
Differentiable Neural Architecture Search.
Quantization.
Problem Formulation
Channel-wise Sparse Quantization (CSQ)
Memory Cost Bottleneck.
CSQ.
Accelerator Architecture Search
Accelerator Search Space.
...and 16 more sections

Figures (5)

Figure 1: JAQ framework. The left part represents the optimization of network structure and bitwidths allocation, addressing the memory cost bottleneck through channel-wise sparse quantization. The right part depicts accelerator architecture search, including hardware parameters and compiler mapping strategy. Hardware metrics indicate accelerator performance (Energy, Latency and Area).
Figure 2: (a) depicts the GPU memory usage with increasing bitwidths choices on CIFAR-100 and ImageNet (batch size is 128). (b) presents the GPU memory usage during the quantization stage for weights and activations on CIFAR-100 and ImageNet (batch size 256). (c) contrasts GPU memory usage on CIFAR-100 among our work and the non-optimized baseline (batch size 256).
Figure 3: The overall accelerator search framework of JAQ. The right part represents the executing workload of a CNN operator after compiler mapping, which can be segmented into tiles across five dimensions. The left part displays an optimization pipeline including subnet encoder, accelerator parameters search, and the BatchTile method. The bottom section elaborates on the meanings of each field within the three distinct vectors.
Figure 4: Visualization of searched network, bitwidths and accelerator on CIFAR-100.
Figure 5: (a) and (c) demonstrate the problems of parameter coupling and misguided search in the previous work fu2021auto algorithm under unconstrained condition. (b) and (d) illustrate that our channel-wise sparse quantization algorithm does not exist parameter coupling or misguided search issues under unconstrained condition.

JAQ: Joint Efficient Architecture Design and Low-Bit Quantization with Hardware-Software Co-Exploration

TL;DR

Abstract

JAQ: Joint Efficient Architecture Design and Low-Bit Quantization with Hardware-Software Co-Exploration

Authors

TL;DR

Abstract

Table of Contents

Figures (5)