Table of Contents
Fetching ...

Auto-NBA: Efficient and Effective Search Over the Joint Space of Networks, Bitwidths, and Accelerators

Yonggan Fu, Yongan Zhang, Yang Zhang, David Cox, Yingyan Celine Lin

TL;DR

This work tackles the problem of jointly optimizing networks, mixed-precision (bitwidths), and accelerators to maximize DNN performance. It introduces Auto-NBA, a bi-level optimization framework that combines two key innovations: heterogeneous sampling for scalable, unbiased network-precision search and a differentiable accelerator search engine that operates over a general chunk-based hardware template. Empirical results across CIFAR and ImageNet on FPGA and ASIC platforms show Auto-NBA achieves substantially faster search and superior accuracy–throughput/EDP trade-offs compared with state-of-the-art co-search, one-shot NAS, and hardware-aware NAS baselines. The proposed method provides a scalable, generic tool to accelerate DNN accelerator development and contributes practical insights into joint design of networks, precision, and hardware.

Abstract

While maximizing deep neural networks' (DNNs') acceleration efficiency requires a joint search/design of three different yet highly coupled aspects, including the networks, bitwidths, and accelerators, the challenges associated with such a joint search have not yet been fully understood and addressed. The key challenges include (1) the dilemma of whether to explode the memory consumption due to the huge joint space or achieve sub-optimal designs, (2) the discrete nature of the accelerator design space that is coupled yet different from that of the networks and bitwidths, and (3) the chicken and egg problem associated with network-accelerator co-search, i.e., co-search requires operation-wise hardware cost, which is lacking during search as the optimal accelerator depending on the whole network is still unknown during search. To tackle these daunting challenges towards optimal and fast development of DNN accelerators, we propose a framework dubbed Auto-NBA to enable jointly searching for the Networks, Bitwidths, and Accelerators, by efficiently localizing the optimal design within the huge joint design space for each target dataset and acceleration specification. Our Auto-NBA integrates a heterogeneous sampling strategy to achieve unbiased search with constant memory consumption, and a novel joint-search pipeline equipped with a generic differentiable accelerator search engine. Extensive experiments and ablation studies validate that both Auto-NBA generated networks and accelerators consistently outperform state-of-the-art designs (including co-search/exploration techniques, hardware-aware NAS methods, and DNN accelerators), in terms of search time, task accuracy, and accelerator efficiency. Our codes are available at: https://github.com/RICE-EIC/Auto-NBA.

Auto-NBA: Efficient and Effective Search Over the Joint Space of Networks, Bitwidths, and Accelerators

TL;DR

This work tackles the problem of jointly optimizing networks, mixed-precision (bitwidths), and accelerators to maximize DNN performance. It introduces Auto-NBA, a bi-level optimization framework that combines two key innovations: heterogeneous sampling for scalable, unbiased network-precision search and a differentiable accelerator search engine that operates over a general chunk-based hardware template. Empirical results across CIFAR and ImageNet on FPGA and ASIC platforms show Auto-NBA achieves substantially faster search and superior accuracy–throughput/EDP trade-offs compared with state-of-the-art co-search, one-shot NAS, and hardware-aware NAS baselines. The proposed method provides a scalable, generic tool to accelerate DNN accelerator development and contributes practical insights into joint design of networks, precision, and hardware.

Abstract

While maximizing deep neural networks' (DNNs') acceleration efficiency requires a joint search/design of three different yet highly coupled aspects, including the networks, bitwidths, and accelerators, the challenges associated with such a joint search have not yet been fully understood and addressed. The key challenges include (1) the dilemma of whether to explode the memory consumption due to the huge joint space or achieve sub-optimal designs, (2) the discrete nature of the accelerator design space that is coupled yet different from that of the networks and bitwidths, and (3) the chicken and egg problem associated with network-accelerator co-search, i.e., co-search requires operation-wise hardware cost, which is lacking during search as the optimal accelerator depending on the whole network is still unknown during search. To tackle these daunting challenges towards optimal and fast development of DNN accelerators, we propose a framework dubbed Auto-NBA to enable jointly searching for the Networks, Bitwidths, and Accelerators, by efficiently localizing the optimal design within the huge joint design space for each target dataset and acceleration specification. Our Auto-NBA integrates a heterogeneous sampling strategy to achieve unbiased search with constant memory consumption, and a novel joint-search pipeline equipped with a generic differentiable accelerator search engine. Extensive experiments and ablation studies validate that both Auto-NBA generated networks and accelerators consistently outperform state-of-the-art designs (including co-search/exploration techniques, hardware-aware NAS methods, and DNN accelerators), in terms of search time, task accuracy, and accelerator efficiency. Our codes are available at: https://github.com/RICE-EIC/Auto-NBA.

Paper Structure

This paper contains 14 sections, 4 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Illustrating our Auto-NBA framework: The middle part shows (1) a high-level view of Auto-NBA and (2) the technical challenges that Auto-NBA tackles for enabling a scalable, generic joint-search for the networks, bitwidths, and accelerators.
  • Figure 2: (a) GPU memory consumption comparison between soft Gumbel Softmax (GS) and hard GS sampling, which are two activating approaches for co-search for the network and precision; and the probability evolution of each precision choice during the search process in the 4-th block when searching with: (b) hard GS sampling for updating both the weights $\omega$ and precision choices $\beta$, which results in the lowest 4-bit, and (c) the proposed heterogeneous sampling for updating $\omega$ and $\beta$, which results in the highest 12-bit (desired).
  • Figure 3: Accuracy vs. FPS trade-off of Auto-NBA against SOTA efficient DNN solutions on ImageNet.
  • Figure 4: Benchmark Auto-NBA w/ and w/o precision search (denoted as Auto-NBA-Mixed and Auto-NBA-16bit, respectively) with SOTA network/accelerator co-exploration methods jiang2020hardwareabdelfattah2020best on CIFAR-10/100/ImageNet.
  • Figure 5: Accuracy vs. FPS trade-off of Auto-NBA, Auto-NBA w/o heterogeneous sampling, and the sequential optimization baseline on CIFAR-100, under an FPGA DSP limit of 512.
  • ...and 1 more figures