High Performance Im2win and Direct Convolutions using Three Tensor Layouts on SIMD Architectures
Xiang Fu, Xinpeng Zhang, Jixiang Ma, Peng Zhao, Shuai Lu, Xu T. Liu
TL;DR
This work tackles the sensitivity of convolution performance to tensor data layouts on SIMD architectures and the lack of comprehensive benchmarking for im2win and related methods. It introduces three im2win-oriented layouts—NHWC, CHWN, and CHWN8—along with a roofline-guided set of optimizations applicable to both im2win and direct convolutions, and benchmarks them against PyTorch's im2col-based approach. The experiments show that NHWC-im2win achieves up to 11%–355% speedups over the NCHW layout, while both im2win and direct convolutions can reach up to approximately 95% and 94% of the machine’s theoretical peak, respectively. The findings underscore the importance of layout-aware optimization for CPU-backed CNN workloads and provide practical guidance for deploying fast convolution kernels on SIMD CPUs.
Abstract
Convolution is the core component within deep neural networks and it is computationally intensive and time consuming. Tensor data layouts significantly impact convolution operations in terms of memory access and computational efficiency. Yet, there is still a lack of comprehensive performance characterization on data layouts on SIMD architectures concerning convolution methods. This paper proposes three novel data layouts for im2win convolution: NHWC, CHWN, and CHWN8, and introduces a set of general optimization techniques for both direct and im2win convolutions. We compare the optimized im2win convolution with the direct convolution and PyTorch's im2col-based convolution across the aforementioned layouts on SIMD machines. The experiments demonstrated that the im2win convolution with the new NHWC layout achieved up to 355% performance speedup over NCHW layout. Our optimizations also significantly improve the performance of both im2win and direct convolutions. Our optimized im2win and direct convolutions achieved up to 95% and 94% of machine's theoretical peak performance, respectively.
