The Hidden Power of Pure 16-bit Floating-Point Neural Networks
Juyoung Yun, Byungkon Kang, Zhoulai Fu
TL;DR
The paper tackles whether pure 16-bit floating-point neural networks can match the accuracy of 32-bit models in classification tasks. It develops an error-tolerance framework, defining output discrepancy $ oldsymbol{ extdelta}(M_{32},M_{16},x)$ and the prediction margin $ oldsymbol{ extGamma}(M,x)$, and proves a sufficient condition $ oldsymbol{ extGamma}(M_{32},x) obreak\\geq\\ 2oldsymbol{ extdelta}(M_{32},M_{16},x)$ under which $ ext{pred}(M_{32},x)= ext{pred}(M_{16},x)$. The authors validate their theory with MNIST and CIFAR-10 experiments across DNN and CNN architectures, showing that pure 16-bit networks achieve competitive or better accuracy than 32-bit and often surpass mixed-precision baselines, while delivering substantial reductions in training time and model size. They also discuss practical limitations, such as the need to tune optimizer epsilon, challenges with batch normalization in pure 16-bit, and batch-size effects, outlining avenues for extending the work to more architectures and tasks. Overall, the findings challenge the notion that pure 16-bit training is impractical, highlighting significant efficiency gains with minimal loss in performance in typical classification scenarios.
Abstract
Lowering the precision of neural networks from the prevalent 32-bit precision has long been considered harmful to performance, despite the gain in space and time. Many works propose various techniques to implement half-precision neural networks, but none study pure 16-bit settings. This paper investigates the unexpected performance gain of pure 16-bit neural networks over the 32-bit networks in classification tasks. We present extensive experimental results that favorably compare various 16-bit neural networks' performance to those of the 32-bit models. In addition, a theoretical analysis of the efficiency of 16-bit models is provided, which is coupled with empirical evidence to back it up. Finally, we discuss situations in which low-precision training is indeed detrimental.
