High Performance Im2win and Direct Convolutions using Three Tensor Layouts on SIMD Architectures

Xiang Fu; Xinpeng Zhang; Jixiang Ma; Peng Zhao; Shuai Lu; Xu T. Liu

High Performance Im2win and Direct Convolutions using Three Tensor Layouts on SIMD Architectures

Xiang Fu, Xinpeng Zhang, Jixiang Ma, Peng Zhao, Shuai Lu, Xu T. Liu

TL;DR

This work tackles the sensitivity of convolution performance to tensor data layouts on SIMD architectures and the lack of comprehensive benchmarking for im2win and related methods. It introduces three im2win-oriented layouts—NHWC, CHWN, and CHWN8—along with a roofline-guided set of optimizations applicable to both im2win and direct convolutions, and benchmarks them against PyTorch's im2col-based approach. The experiments show that NHWC-im2win achieves up to 11%–355% speedups over the NCHW layout, while both im2win and direct convolutions can reach up to approximately 95% and 94% of the machine’s theoretical peak, respectively. The findings underscore the importance of layout-aware optimization for CPU-backed CNN workloads and provide practical guidance for deploying fast convolution kernels on SIMD CPUs.

Abstract

Convolution is the core component within deep neural networks and it is computationally intensive and time consuming. Tensor data layouts significantly impact convolution operations in terms of memory access and computational efficiency. Yet, there is still a lack of comprehensive performance characterization on data layouts on SIMD architectures concerning convolution methods. This paper proposes three novel data layouts for im2win convolution: NHWC, CHWN, and CHWN8, and introduces a set of general optimization techniques for both direct and im2win convolutions. We compare the optimized im2win convolution with the direct convolution and PyTorch's im2col-based convolution across the aforementioned layouts on SIMD machines. The experiments demonstrated that the im2win convolution with the new NHWC layout achieved up to 355% performance speedup over NCHW layout. Our optimizations also significantly improve the performance of both im2win and direct convolutions. Our optimized im2win and direct convolutions achieved up to 95% and 94% of machine's theoretical peak performance, respectively.

High Performance Im2win and Direct Convolutions using Three Tensor Layouts on SIMD Architectures

TL;DR

Abstract

Paper Structure (16 sections, 5 equations, 13 figures, 1 table, 3 algorithms)

This paper contains 16 sections, 5 equations, 13 figures, 1 table, 3 algorithms.

Introduction
Preliminary and Related Works
Notation
Tensor Layouts: NCHW, NHWC, CHWN
Convolution Algorithms and Related Works
High-performance Im2win and Direct Convolution using Three Tensor Layouts
Motivations for Different Tensor Layouts
Im2win Tensor Transformation on Three Tensor Layouts
Loop Reordering
Optimizations for the Im2win and Direct Convolutions
Experiments
Experimental Setup
Performance of Different Convolution Algorithms
Conclusion
Peak Performance
...and 1 more sections

Figures (13)

Figure 1: The original input tensor ($N=1,H_{original}=W_{original}=C_{original}=3$) and its corresponding im2win tensor ($N=1,C_{im2win}=3,H_{im2win}=2,W_{im2win}=6$) in the NCHW layout
Figure 2: The original input tensor ($N_{i}=1,H_{i}=W_{i}=C_{i}=3$) and its corresponding im2win tensor ($N_{i}=1,C_{i}=3,H_{i}=2,W_{i}=6$) in the NHWC layout, the filter tensor ($N_{f}=1,C_{f}=3,H_{f}=W_{f}=2$), $s=1$, the output tensor ($N_{o}=1,C_{o}=1,H_{o}=W_{o}=2$)
Figure 3: The original tensor ($N_{i}=8,H_{i}=W_{i}=C_{i}=3$) and its corresponding im2win tensor ($N_{i}=8,C_{i}=3,H_{i}=2,W_{i}=6$) in the CHWN/CHWN8 layout
Figure 4: Performance results in TFLOPS of the direct convolution, the im2win convolution and the im2col-based convolution using different layouts. Note that the theoretical peak performance of the server is 3584 GFLOPS.
Figure 5: Memory usage of the direct, the im2win and the im2col-based convolutions using different tensor layouts. Note that in conv4, the im2col-based convolutions with the NWHC and NCHW layouts use 21GB of memory.
...and 8 more figures

High Performance Im2win and Direct Convolutions using Three Tensor Layouts on SIMD Architectures

TL;DR

Abstract

High Performance Im2win and Direct Convolutions using Three Tensor Layouts on SIMD Architectures

Authors

TL;DR

Abstract

Table of Contents

Figures (13)