Table of Contents
Fetching ...

Optimizing Winograd Convolution on ARMv8 processors

Haoyuan Gui, Xiaoyu Zhang, Chong Zhang, Zitong Su, Huiyuan Li

TL;DR

The paper tackles the inefficiencies of Winograd convolution on ARMv8 by introducing a fused Winograd framework that unifies input transformation, GEMM, and output transformation with a cache-aware, assembly-optimized design. It introduces a z-shaped data layout, a ping-pong GEMM micro-kernel, and a multi-dimensional parallel strategy to adapt to layer scales, achieving substantial cross-platform speedups and strong scalability. Empirical results show significant improvements over existing ARM libraries across multiple platforms, with accuracy preserved within practical bounds. The work advances practical Winograd acceleration on mobile and server ARM CPUs and points to further extensions to 3-D Winograd and other architectures.

Abstract

As Convolutional Neural Networks (CNNs) gain prominence in deep learning, algorithms like Winograd Convolution have been introduced to enhance computational efficiency. However, existing implementations often face challenges such as high transformation overhead, suboptimal computation efficiency, and reduced parallel performance in some layers. We propose a fused Winograd Convolution algorithm optimized for ARMv8 CPUs, integrating input transformation, filter transformation, computation, and output transformation into a single pipeline. By maintaining consecutive memory access and using a custom z-shaped data layout, our approach fully utilizes an optimized GEMM micro-kernel with a ping-pong technique. Additionally, we introduce a multi-dimensional parallel strategy that adapts to convolutional layer scales. To maximize performance, we manually optimize each kernel in AArch64 assembly and carefully tune blocking parameters. Experimental results show speedups of up to 4.74x, 4.10x, 4.72x, and 10.57x over NCNN, NNPACK, FastConv, and ACL on the Kunpeng 920 platform using multiple threads, with respective gains of 3.85x, 2.81x, 4.20x, and 7.80x on the AWS Graviton2, and 3.32x, 3.68x, 8.00x, and 9.28x on the Phytium 2000+.

Optimizing Winograd Convolution on ARMv8 processors

TL;DR

The paper tackles the inefficiencies of Winograd convolution on ARMv8 by introducing a fused Winograd framework that unifies input transformation, GEMM, and output transformation with a cache-aware, assembly-optimized design. It introduces a z-shaped data layout, a ping-pong GEMM micro-kernel, and a multi-dimensional parallel strategy to adapt to layer scales, achieving substantial cross-platform speedups and strong scalability. Empirical results show significant improvements over existing ARM libraries across multiple platforms, with accuracy preserved within practical bounds. The work advances practical Winograd acceleration on mobile and server ARM CPUs and points to further extensions to 3-D Winograd and other architectures.

Abstract

As Convolutional Neural Networks (CNNs) gain prominence in deep learning, algorithms like Winograd Convolution have been introduced to enhance computational efficiency. However, existing implementations often face challenges such as high transformation overhead, suboptimal computation efficiency, and reduced parallel performance in some layers. We propose a fused Winograd Convolution algorithm optimized for ARMv8 CPUs, integrating input transformation, filter transformation, computation, and output transformation into a single pipeline. By maintaining consecutive memory access and using a custom z-shaped data layout, our approach fully utilizes an optimized GEMM micro-kernel with a ping-pong technique. Additionally, we introduce a multi-dimensional parallel strategy that adapts to convolutional layer scales. To maximize performance, we manually optimize each kernel in AArch64 assembly and carefully tune blocking parameters. Experimental results show speedups of up to 4.74x, 4.10x, 4.72x, and 10.57x over NCNN, NNPACK, FastConv, and ACL on the Kunpeng 920 platform using multiple threads, with respective gains of 3.85x, 2.81x, 4.20x, and 7.80x on the AWS Graviton2, and 3.32x, 3.68x, 8.00x, and 9.28x on the Phytium 2000+.

Paper Structure

This paper contains 19 sections, 13 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overview of the procedure for processing one tile of input using our method, which includes the three stages of Winograd convolution: Input/Filter Transformation, Matrix Multiplication, and Output Transformation. The yellow, blue, and green rectangles represent the data of the input, filter, and output, respectively. The highlighted sections of each color indicate the data loaded into the same vector register, which will be processed simultaneously. After transforming the input and filter, the data is packed into a layout that is friendly to GEMM operations, ensuring consecutive memory access during computation. The results of the GEMM are then transformed back to the spatial domain and stored in the final output.
  • Figure 2: This figure illustrates the transformation process of $F(2\times2,3\times3)$. Each vector register contains $\theta$ elements. For simplicity, we present a front view of this process. The figure demonstrates the register arrangement of our method, with numbers denoting the index of the vector registers. In the initial iteration, the entire tile is loaded into registers $v0$ to $v15$, while $v16$ to $v31$ are used to store the results. When left-multiplying with $B^T$, registers $v0,v1,v8$ and $v9$ are freed to store the temporary results. After processing the first tile, data in registers $v2,v3,v6,v7,v10,v11,v14$ and $v15$ can be reused, requiring only the non-overlapping data of the second tile to be loaded into $v0, v1, v4, v5, v8, v9, v12$ and $v13$. For the next tile, the process is reversed: reusing $v0, v1, v4, v5, v8, v9, v12$ and $v13$ , and loading new data into $v2,v3,v6,v7,v10,v11,v14$ and $v15$. This alternating pattern continues for subsequent iterations, significantly reducing the number of elements that need to be loaded.
  • Figure 3: This figure depicts the data layout used in our implementation for the transformed input and filter. The core concept of our method is to initially divide the original matrix into blocks that fit within the cache capacity. These blocks are then processed using multiple micro-kernels for GEMM operations. Each micro-kernel handles the matrix multiplication involving $\alpha$ rows and $\eta$ columns. By organizing the data layout in this manner, we ensure continuous memory access, which significantly improves performance. In this figure, we primarily highlight the data arrangement within each block and the relationships between blocks.
  • Figure 4: This figure illustrates the arrangement of vector registers for the micro-kernel. The notation $\#num$ denotes the stage number of the pipeline in the "ping-pong" technique, and each number represents the index of the vector register. Both configurations utilize the entire set of 32 SIMD registers. The yellow, blue, and green rectangles represent the data of the input, filter, and result, respectively.
  • Figure 5: Step-wise comparison of the convolution layers’ runtime against NCNN, NNPACK, FastConv and ACL with the same $F(m,r)$ on the Kunpeng920. Each point on the x-axis represents a different layer, with VN and FN being abbreviations for VggNet and FusionNet, respectively. The y-axis denotes the runtime in milliseconds ($ms$). The left figure shows $F(2\times2,3\times3)$, the middle figure shows $F(4\times4,3\times3)$ and the right figure shows $F(6\times6,3\times3)$. Each number above the bars represents the speedup our approach achieves compared to the corresponding library.
  • ...and 4 more figures