Table of Contents
Fetching ...

FTuner: A Fast Dynamic Shape Tensors Program Auto-Tuner for Deep Learning Compilers

Pengyu Mu, Linquan Wei, Yi Liu, Rui Wang

TL;DR

FTuner tackles dynamic shape tensors by replacing large-cost auto-tune searches with a hardware-aligned, abstract unit called a uKernel that composes variable-shaped inputs. It precomputes a rich ukernel set through hardware-aware compilation (hardware alignment, parallelism constraints, multi-axis analysis) and, at runtime, assembles programs from these units to minimize padding via synthesis index analysis (SIA). The approach yields competitive end-to-end performance against vendor libraries while reducing tuning time by orders of magnitude and avoiding costly training costs typical of cost-model-based Auto-tuners. Practically, FTuner is portable across GPU architectures and demonstrates substantial padding reductions and SM-utilization improvements, enabling faster deployment of dynamic-shape models.

Abstract

Many artificial intelligence models process input data of different lengths and resolutions, making the shape of the tensors dynamic. The performance of these models depends on the shape of the tensors, which makes it difficult to optimize the tensors before the model runs. There are two common solutions to this problem. The first is to add useless data to the input to match a pre-optimized tensor library. The second is to use small basic tensors to create a tensor that is closest in size to the input data and then tune it to minimize padding. However, this second solution can be time-consuming. This paper proposes a new technique for deep learning compilers called FTuner. Instead of using a large design space or training a cost model, we use an abstract computational unit called the uKernel to patch together small, various-sized tensors to match the shape of the input tensor. We determine the shape of the uKernel using an analytic hardware information model. Experiments show that the FTuner can achieve comparable operators and end-to-end performance to vendor libraries and achieves 3\% speedup on existing auto-tuner with the model-training compiler while reducing tuning time by two orders of magnitude.

FTuner: A Fast Dynamic Shape Tensors Program Auto-Tuner for Deep Learning Compilers

TL;DR

FTuner tackles dynamic shape tensors by replacing large-cost auto-tune searches with a hardware-aligned, abstract unit called a uKernel that composes variable-shaped inputs. It precomputes a rich ukernel set through hardware-aware compilation (hardware alignment, parallelism constraints, multi-axis analysis) and, at runtime, assembles programs from these units to minimize padding via synthesis index analysis (SIA). The approach yields competitive end-to-end performance against vendor libraries while reducing tuning time by orders of magnitude and avoiding costly training costs typical of cost-model-based Auto-tuners. Practically, FTuner is portable across GPU architectures and demonstrates substantial padding reductions and SM-utilization improvements, enabling faster deployment of dynamic-shape models.

Abstract

Many artificial intelligence models process input data of different lengths and resolutions, making the shape of the tensors dynamic. The performance of these models depends on the shape of the tensors, which makes it difficult to optimize the tensors before the model runs. There are two common solutions to this problem. The first is to add useless data to the input to match a pre-optimized tensor library. The second is to use small basic tensors to create a tensor that is closest in size to the input data and then tune it to minimize padding. However, this second solution can be time-consuming. This paper proposes a new technique for deep learning compilers called FTuner. Instead of using a large design space or training a cost model, we use an abstract computational unit called the uKernel to patch together small, various-sized tensors to match the shape of the input tensor. We determine the shape of the uKernel using an analytic hardware information model. Experiments show that the FTuner can achieve comparable operators and end-to-end performance to vendor libraries and achieves 3\% speedup on existing auto-tuner with the model-training compiler while reducing tuning time by two orders of magnitude.
Paper Structure (19 sections, 7 equations, 16 figures, 6 tables, 2 algorithms)

This paper contains 19 sections, 7 equations, 16 figures, 6 tables, 2 algorithms.

Figures (16)

  • Figure 1: Tensor shape diversity and padding cost. (a) The shape of input tensors varies across different datasets from the standard NLP benchmark GLUE glue. (b) As the batch size increases, the amount of useless padding of the batch matrix multiplication grows.
  • Figure 2: Illustration of Padding for Dynamic Shapes. Padding the dynamic shape to match the optimized kernel in the manual library.
  • Figure 3: The execution time breakdown of tensors optimized by different compilers. We used the NCU ncu to estimate the computation and memory access times for two shapes of the Dense operator on V100, normalized by Vendor. Roller roller has a higher proportion of padding. Since the padding time of the Vendor cannot be measured, we estimate the padding time roughly by calculating the difference in computation time between the current kernel and the strictly aligned kernel.
  • Figure 4: Different tiling for matrix multiplication. (a) represents a matrix multiplication output dimension with a prime size axis, (b) is composed of a single kernel, (c) is the method adopted in this paper, which achieves zero-padding along the j-axis using two kernels, and (d) represents an ideal combined state that is difficult to achieve.
  • Figure 5: The overall architecture of FTuner. Taking the Matmul operator as an example, we assume that the input shape is denoted by the symbol T along the i-axis, while the j and k axes remain fixed. We replaced the portion from tensor splitting to generating optimized tensor programs in DietCode dietcode.
  • ...and 11 more figures