FBGEMM: Enabling High-Performance Low-Precision Deep Learning Inference
Daya Khudia, Jianyu Huang, Protonu Basu, Summer Deng, Haixin Liu, Jongsoo Park, Mikhail Smelyanskiy
TL;DR
The paper presents fbGEMM, a CPU-focused library for high-performance quantized inference that fuses quantization steps with a highly optimized, shape-aware GEMM implementation. By using affine quantization, prepacking weights, and modular, fusion-friendly building blocks, fbGEMM achieves substantial speedups (over 2x) on real-world Facebook workloads like translation and content understanding. It demonstrates end-to-end pipelines with INT8 GEMMs and INT16 accumulation, including outlier-aware quantization and dynamic kernel generation for diverse matrix shapes. The work suggests that tight integration of packing, quantization, and post-gemm processing can substantially improve efficiency and may inform future HPC GEMM interfaces.
Abstract
Deep learning models typically use single-precision (FP32) floating point data types for representing activations and weights, but a slew of recent research work has shown that computations with reduced-precision data types (FP16, 16-bit integers, 8-bit integers or even 4- or 2-bit integers) are enough to achieve same accuracy as FP32 and are much more efficient. Therefore, we designed fbgemm, a high-performance kernel library, from ground up to perform high-performance quantized inference on current generation CPUs. fbgemm achieves efficiency by fusing common quantization operations with a high-performance gemm implementation and by shape- and size-specific kernel code generation at runtime. The library has been deployed at Facebook, where it delivers greater than 2x performance gains with respect to our current production baseline.
