Table of Contents
Fetching ...

maxDNN: An Efficient Convolution Kernel for Deep Learning with Maxwell GPUs

Andrew Lavin

TL;DR

maxDNN targets efficient forward convolution on Maxwell GPUs by combining a cuda-convnet2-style data layout with a Maxas SGEMM64 kernel. It precomputes input patch offsets, uses 64×64 tiling and zero-padding to optimize memory access, achieving near 96% computational efficiency. Comparative experiments against cuDNN v2 RC1 show substantial efficiency gains across Alexnet v2 and OverFeat, though first layers can incur penalties when patch sizes are small. The work demonstrates that high-efficiency GPU convolution kernels are feasible on Maxwell and suggests extending the approach to backward propagation (BPROP).

Abstract

This paper describes maxDNN, a computationally efficient convolution kernel for deep learning with the NVIDIA Maxwell GPU. maxDNN reaches 96.3% computational efficiency on typical deep learning network architectures. The design combines ideas from cuda-convnet2 with the Maxas SGEMM assembly code. We only address forward propagation (FPROP) operation of the network, but we believe that the same techniques used here will be effective for backward propagation (BPROP) as well.

maxDNN: An Efficient Convolution Kernel for Deep Learning with Maxwell GPUs

TL;DR

maxDNN targets efficient forward convolution on Maxwell GPUs by combining a cuda-convnet2-style data layout with a Maxas SGEMM64 kernel. It precomputes input patch offsets, uses 64×64 tiling and zero-padding to optimize memory access, achieving near 96% computational efficiency. Comparative experiments against cuDNN v2 RC1 show substantial efficiency gains across Alexnet v2 and OverFeat, though first layers can incur penalties when patch sizes are small. The work demonstrates that high-efficiency GPU convolution kernels are feasible on Maxwell and suggests extending the approach to backward propagation (BPROP).

Abstract

This paper describes maxDNN, a computationally efficient convolution kernel for deep learning with the NVIDIA Maxwell GPU. maxDNN reaches 96.3% computational efficiency on typical deep learning network architectures. The design combines ideas from cuda-convnet2 with the Maxas SGEMM assembly code. We only address forward propagation (FPROP) operation of the network, but we believe that the same techniques used here will be effective for backward propagation (BPROP) as well.

Paper Structure

This paper contains 8 sections, 4 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: FPROP convolution with minibatch size 128 for Alexnet v.2.
  • Figure 2: FPROP convolution with minibatch size 128 for Overfeat. maxDNN efficiency suffers when the number of filters is not a multiple of 64, but is otherwise consistently high. maxDNN variants with other shared memory blocking sizes would likely address this shortcoming.