Table of Contents
Fetching ...

Explore as a Storm, Exploit as a Raindrop: On the Benefit of Fine-Tuning Kernel Schedulers with Coordinate Descent

Michael Canesche, Gaurav Verma, Fernando Magno Quintao Pereira

TL;DR

The paper tackles kernel scheduling in tensor compilers by merging broad exploration with focused exploitation. It introduces DPAnsor, which uses Ansor to explore kernel sketches and then applies Droplet Search within the best discovered space to refine a kernel efficiently. Empirical results across four architectures and 20 ONNX models show DPAnsor delivers faster end-to-end models and reduces search time, with patches approved for TVM's Ansor (Feb 2024) and MetaSchedule (Jun 2024). This hardware-aware autotuning approach improves kernel quality while trimming search overhead, suggesting broad applicability to other autotuners and real-world deployment scenarios.

Abstract

Machine-learning models consist of kernels, which are algorithms applying operations on tensors -- data indexed by a linear combination of natural numbers. Examples of kernels include convolutions, transpositions, and vectorial products. There are many ways to implement a kernel. These implementations form the kernel's optimization space. Kernel scheduling is the problem of finding the best implementation, given an objective function -- typically execution speed. Kernel optimizers such as Ansor, Halide, and AutoTVM solve this problem via search heuristics, which combine two phases: exploration and exploitation. The first step evaluates many different kernel optimization spaces. The latter tries to improve the best implementations by investigating a kernel within the same space. For example, Ansor combines kernel generation through sketches for exploration and leverages an evolutionary algorithm to exploit the best sketches. In this work, we demonstrate the potential to reduce Ansor's search time while enhancing kernel quality by incorporating Droplet Search, an AutoTVM algorithm, into Ansor's exploration phase. The approach involves limiting the number of samples explored by Ansor, selecting the best, and exploiting it with a coordinate descent algorithm. By applying this approach to the first 300 kernels that Ansor generates, we usually obtain better kernels in less time than if we let Ansor analyze 10,000 kernels. This result has been replicated in 20 well-known deep-learning models (AlexNet, ResNet, VGG, DenseNet, etc.) running on four architectures: an AMD Ryzen 7 (x86), an NVIDIA A100 tensor core, an NVIDIA RTX 3080 GPU, and an ARM A64FX. A patch with this combined approach was approved in Ansor in February 2024. As evidence of the generality of this search methodology, a similar patch, achieving equally good results, was submitted to TVM's MetaSchedule in June 2024.

Explore as a Storm, Exploit as a Raindrop: On the Benefit of Fine-Tuning Kernel Schedulers with Coordinate Descent

TL;DR

The paper tackles kernel scheduling in tensor compilers by merging broad exploration with focused exploitation. It introduces DPAnsor, which uses Ansor to explore kernel sketches and then applies Droplet Search within the best discovered space to refine a kernel efficiently. Empirical results across four architectures and 20 ONNX models show DPAnsor delivers faster end-to-end models and reduces search time, with patches approved for TVM's Ansor (Feb 2024) and MetaSchedule (Jun 2024). This hardware-aware autotuning approach improves kernel quality while trimming search overhead, suggesting broad applicability to other autotuners and real-world deployment scenarios.

Abstract

Machine-learning models consist of kernels, which are algorithms applying operations on tensors -- data indexed by a linear combination of natural numbers. Examples of kernels include convolutions, transpositions, and vectorial products. There are many ways to implement a kernel. These implementations form the kernel's optimization space. Kernel scheduling is the problem of finding the best implementation, given an objective function -- typically execution speed. Kernel optimizers such as Ansor, Halide, and AutoTVM solve this problem via search heuristics, which combine two phases: exploration and exploitation. The first step evaluates many different kernel optimization spaces. The latter tries to improve the best implementations by investigating a kernel within the same space. For example, Ansor combines kernel generation through sketches for exploration and leverages an evolutionary algorithm to exploit the best sketches. In this work, we demonstrate the potential to reduce Ansor's search time while enhancing kernel quality by incorporating Droplet Search, an AutoTVM algorithm, into Ansor's exploration phase. The approach involves limiting the number of samples explored by Ansor, selecting the best, and exploiting it with a coordinate descent algorithm. By applying this approach to the first 300 kernels that Ansor generates, we usually obtain better kernels in less time than if we let Ansor analyze 10,000 kernels. This result has been replicated in 20 well-known deep-learning models (AlexNet, ResNet, VGG, DenseNet, etc.) running on four architectures: an AMD Ryzen 7 (x86), an NVIDIA A100 tensor core, an NVIDIA RTX 3080 GPU, and an ARM A64FX. A patch with this combined approach was approved in Ansor in February 2024. As evidence of the generality of this search methodology, a similar patch, achieving equally good results, was submitted to TVM's MetaSchedule in June 2024.
Paper Structure (37 sections, 19 figures)

This paper contains 37 sections, 19 figures.

Figures (19)

  • Figure 1: (a) Abstract view of a kernel. (b) Naïve implementation of the abstract kernel. (c) Two optimization sketches for the naïve kernel. (d) Different annotations for the sketches.
  • Figure 2: (a) A three-dimensional view of the optimization space formed by the parameters P1 and P2 seen in Figure \ref{['fig:abstract_kernel']} c-i. (b) A three-dimensional view of the optimization space of parameters P9 and PC.
  • Figure 3: Sketch generation rules. Ansor uses these rules to change the kernel search space. Each rule modifies a sketch, e.g., fusing, splitting or tiling loops. However, these rules do not change the annotations in the sketch.
  • Figure 4: (a) Sketch of the abstract kernel seen in Figure \ref{['fig:abstract_kernel']} (a). (b) Sketch that ensues from the application of the "Always inlining" rule. (c) Sketch that ensues from the initialization of the unrolling factor.
  • Figure 5: Rules that Ansor uses to create an initial population of kernels, which will be the starting point to the evolutionary search.
  • ...and 14 more figures

Theorems & Definitions (1)

  • definition 1: The Kernel Search Space