Table of Contents
Fetching ...

Scaling Up Quantization-Aware Neural Architecture Search for Efficient Deep Learning on the Edge

Yao Lu, Hiram Rayo Torres Rodriguez, Sebastian Vogel, Nick van de Waterlaat, Pavol Jancura

TL;DR

This work tackles scaling quantization-aware NAS to large-scale edge tasks by introducing QA-BWNAS, which injects quantization awareness into block-wise NAS and uses a teacher-student framework with block-wise knowledge distillation. It employs post-training quantization and LUT-based Pareto optimization to jointly search for architecture and few-bit mixed-precision quantization policies under hardware constraints, achieving competitive semantic segmentation performance on Cityscapes with INT8 and FB-MP models. The approach demonstrates significant practical benefits, including up to ~17.6% latency reduction and ~33% model-size reduction, while preserving or improving mIoU, and introduces a faster traversal method that reduces search time from hours to seconds. Overall, QA-BWNAS provides a scalable and efficient pathway for deploying quantized, edge-friendly networks in compute-intensive tasks like semantic segmentation.

Abstract

Neural Architecture Search (NAS) has become the de-facto approach for designing accurate and efficient networks for edge devices. Since models are typically quantized for edge deployment, recent work has investigated quantization-aware NAS (QA-NAS) to search for highly accurate and efficient quantized models. However, existing QA-NAS approaches, particularly few-bit mixed-precision (FB-MP) methods, do not scale to larger tasks. Consequently, QA-NAS has mostly been limited to low-scale tasks and tiny networks. In this work, we present an approach to enable QA-NAS (INT8 and FB-MP) on large-scale tasks by leveraging the block-wise formulation introduced by block-wise NAS. We demonstrate strong results for the semantic segmentation task on the Cityscapes dataset, finding FB-MP models 33% smaller and INT8 models 17.6% faster than DeepLabV3 (INT8) without compromising task performance.

Scaling Up Quantization-Aware Neural Architecture Search for Efficient Deep Learning on the Edge

TL;DR

This work tackles scaling quantization-aware NAS to large-scale edge tasks by introducing QA-BWNAS, which injects quantization awareness into block-wise NAS and uses a teacher-student framework with block-wise knowledge distillation. It employs post-training quantization and LUT-based Pareto optimization to jointly search for architecture and few-bit mixed-precision quantization policies under hardware constraints, achieving competitive semantic segmentation performance on Cityscapes with INT8 and FB-MP models. The approach demonstrates significant practical benefits, including up to ~17.6% latency reduction and ~33% model-size reduction, while preserving or improving mIoU, and introduces a faster traversal method that reduces search time from hours to seconds. Overall, QA-BWNAS provides a scalable and efficient pathway for deploying quantized, edge-friendly networks in compute-intensive tasks like semantic segmentation.

Abstract

Neural Architecture Search (NAS) has become the de-facto approach for designing accurate and efficient networks for edge devices. Since models are typically quantized for edge deployment, recent work has investigated quantization-aware NAS (QA-NAS) to search for highly accurate and efficient quantized models. However, existing QA-NAS approaches, particularly few-bit mixed-precision (FB-MP) methods, do not scale to larger tasks. Consequently, QA-NAS has mostly been limited to low-scale tasks and tiny networks. In this work, we present an approach to enable QA-NAS (INT8 and FB-MP) on large-scale tasks by leveraging the block-wise formulation introduced by block-wise NAS. We demonstrate strong results for the semantic segmentation task on the Cityscapes dataset, finding FB-MP models 33% smaller and INT8 models 17.6% faster than DeepLabV3 (INT8) without compromising task performance.
Paper Structure (14 sections, 1 equation, 3 figures, 2 tables)

This paper contains 14 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of our QA-BWNAS approach. (1) We train the blocks in the student supernet via feature-based knowledge distillation. (2) Subnets in each block are then quantized and evaluated in terms of their distillation loss and secondary HW-related metrics (model size and latency) to populate LUTs for searching. (3) We derive the block-wise Pareto optimal subnets per bitwidth to remove the sub-optimal networks from the solution space. Finally, we jointly search for an architecture and quantization policy under a given HW constraint.
  • Figure 2: QA-BWNAS derives highly optimized solutions, reducing model size up $25\%$ (INT8) and $33\%$ (FB-MP), while retaining mIoU on the Cityscapes validation set.
  • Figure 3: QA-BWNAS reduces inference latency on an i.MX8M Plus up to $17.6\%$ while retaining mIoU performance on the Cityscapes validation set.