Table of Contents
Fetching ...

cuConv: A CUDA Implementation of Convolution for CNN Inference

Marc Jordà, Pedro Valero-Lara, Antonio J. Peña

TL;DR

This paper proposes a GPU-based implementation of the convolution operation for CNN inference that favors coalesced accesses, without requiring prior data transformations, and demonstrates notable performance improvements in a range of common CNN forward-propagation convolution configurations.

Abstract

Convolutions are the core operation of deep learning applications based on Convolutional Neural Networks (CNNs). Current GPU architectures are highly efficient for training and deploying deep CNNs, and hence, these are largely used in production for this purpose. State-of-the-art implementations, however, present a lack of efficiency for some commonly used network configurations. In this paper we propose a GPU-based implementation of the convolution operation for CNN inference that favors coalesced accesses, without requiring prior data transformations. Our experiments demonstrate that our proposal yields notable performance improvements in a range of common CNN forward propagation convolution configurations, with speedups of up to 2.29x with respect to the best implementation of convolution in cuDNN, hence covering a relevant region in currently existing approaches.

cuConv: A CUDA Implementation of Convolution for CNN Inference

TL;DR

This paper proposes a GPU-based implementation of the convolution operation for CNN inference that favors coalesced accesses, without requiring prior data transformations, and demonstrates notable performance improvements in a range of common CNN forward-propagation convolution configurations.

Abstract

Convolutions are the core operation of deep learning applications based on Convolutional Neural Networks (CNNs). Current GPU architectures are highly efficient for training and deploying deep CNNs, and hence, these are largely used in production for this purpose. State-of-the-art implementations, however, present a lack of efficiency for some commonly used network configurations. In this paper we propose a GPU-based implementation of the convolution operation for CNN inference that favors coalesced accesses, without requiring prior data transformations. Our experiments demonstrate that our proposal yields notable performance improvements in a range of common CNN forward propagation convolution configurations, with speedups of up to 2.29x with respect to the best implementation of convolution in cuDNN, hence covering a relevant region in currently existing approaches.

Paper Structure

This paper contains 14 sections, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Convolution operations in a convolutional layer. The matrix produced by the convolution of Input 0 with Filter 0 is highlighted in light blue. The darker output element is the result of the dot product of Filter 0 with the highlighted subvolume of Input 0.
  • Figure 2: Schematic of our target GPU architecture.
  • Figure 3: Reuse of input data for two example rows of a filter (highlighted in blue and orange), for a convolution with a stride of 1. The highlighted input rows (two sets of 6x6 rows) are point-wise multiplied with the filter row of the same color during the convolution computation.
  • Figure 4: Stages of our implementation of convolution depicted for an arbitrary input and the first filter of the convolutional layer. Scalar products in the first stage generate partial results which are aggregated in the second stage to obtain the final output elements.
  • Figure 5: Speedup of our implementation of convolution w.r.t. the best performing cuDNN algorithm for each configuration. Configurations with $1\times1$ filters and batch size up to 64. Labels are formatted as [inputs X&Y size]-[number of filters]-[depth].
  • ...and 2 more figures