maxDNN: An Efficient Convolution Kernel for Deep Learning with Maxwell GPUs
Andrew Lavin
TL;DR
maxDNN targets efficient forward convolution on Maxwell GPUs by combining a cuda-convnet2-style data layout with a Maxas SGEMM64 kernel. It precomputes input patch offsets, uses 64×64 tiling and zero-padding to optimize memory access, achieving near 96% computational efficiency. Comparative experiments against cuDNN v2 RC1 show substantial efficiency gains across Alexnet v2 and OverFeat, though first layers can incur penalties when patch sizes are small. The work demonstrates that high-efficiency GPU convolution kernels are feasible on Maxwell and suggests extending the approach to backward propagation (BPROP).
Abstract
This paper describes maxDNN, a computationally efficient convolution kernel for deep learning with the NVIDIA Maxwell GPU. maxDNN reaches 96.3% computational efficiency on typical deep learning network architectures. The design combines ideas from cuda-convnet2 with the Maxas SGEMM assembly code. We only address forward propagation (FPROP) operation of the network, but we believe that the same techniques used here will be effective for backward propagation (BPROP) as well.
