High performance ultra-low-precision convolutions on mobile devices
Andrew Tulloch, Yangqing Jia
TL;DR
The paper targets mobile CNN acceleration under extreme quantization by delivering an open-source, ARMv7-friendly runtime for ultra-low-precision, including a highly optimized binary inner-product microkernel. It introduces packing and SIMD-friendly techniques, fused quantization, and cache-aware blocking to approach peak hardware throughput. Through extensive benchmarks against float32 and int8 baselines (GEMMLOWP and NNPACK), it shows 4×–20× speedups on Cortex-A7/A53, positioning ultra-low-precision kernels as practical for real-time mobile vision workloads. The work also discusses training considerations (HWGQ, bit-decay) that support high-accuracy deployment of aggressively quantized models on a wide range of devices.
Abstract
Many applications of mobile deep learning, especially real-time computer vision workloads, are constrained by computation power. This is particularly true for workloads running on older consumer phones, where a typical device might be powered by a single- or dual-core ARMv7 CPU. We provide an open-source implementation and a comprehensive analysis of (to our knowledge) the state of the art ultra-low-precision (<4 bit precision) implementation of the core primitives required for modern deep learning workloads on ARMv7 devices, and demonstrate speedups of 4x-20x over our additional state-of-the-art float32 and int8 baselines.
