High performance ultra-low-precision convolutions on mobile devices

Andrew Tulloch; Yangqing Jia

High performance ultra-low-precision convolutions on mobile devices

Andrew Tulloch, Yangqing Jia

TL;DR

The paper targets mobile CNN acceleration under extreme quantization by delivering an open-source, ARMv7-friendly runtime for ultra-low-precision, including a highly optimized binary inner-product microkernel. It introduces packing and SIMD-friendly techniques, fused quantization, and cache-aware blocking to approach peak hardware throughput. Through extensive benchmarks against float32 and int8 baselines (GEMMLOWP and NNPACK), it shows 4×–20× speedups on Cortex-A7/A53, positioning ultra-low-precision kernels as practical for real-time mobile vision workloads. The work also discusses training considerations (HWGQ, bit-decay) that support high-accuracy deployment of aggressively quantized models on a wide range of devices.

Abstract

Many applications of mobile deep learning, especially real-time computer vision workloads, are constrained by computation power. This is particularly true for workloads running on older consumer phones, where a typical device might be powered by a single- or dual-core ARMv7 CPU. We provide an open-source implementation and a comprehensive analysis of (to our knowledge) the state of the art ultra-low-precision (<4 bit precision) implementation of the core primitives required for modern deep learning workloads on ARMv7 devices, and demonstrate speedups of 4x-20x over our additional state-of-the-art float32 and int8 baselines.

High performance ultra-low-precision convolutions on mobile devices

TL;DR

Abstract

High performance ultra-low-precision convolutions on mobile devices

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (1)