Table of Contents
Fetching ...

Optimal Kernel Orchestration for Tensor Programs with Korch

Muyan Hu, Ashwin Venkatram, Shreyashri Biswas, Balamurugan Marimuthu, Bohan Hou, Gabriele Oliaro, Haojie Wang, Liyan Zheng, Xupeng Miao, Jidong Zhai

TL;DR

Korch reframes tensor-program optimization by decomposing DNN operators into fine-grained tensor primitives (operator fission) and then solving kernel orchestration as a binary linear programming problem to minimize end-to-end latency on GPUs. By generating a primitive graph and exhaustively profiling candidate kernels, Korch enables cross-operator optimizations and even allows repeated execution of primitives to reduce kernel-launch and memory overhead. Empirical results show up to 1.7× speedups on V100 and 1.6× on A100 across diverse CNN and vision-transformer workloads, with notable gains from case studies such as EfficientViT attention blocks. The approach offers a principled, solver-based path to kernel-level optimization that complements existing graph- and rule-based fusion methods, and it is publicly available for reproduction and extension.

Abstract

Kernel orchestration is the task of mapping the computation defined in different operators of a deep neural network (DNN) to the execution of GPU kernels on modern hardware platforms. Prior approaches optimize kernel orchestration by greedily applying operator fusion, which fuses the computation of multiple operators into a single kernel, and miss a variety of optimization opportunities in kernel orchestration. This paper presents Korch, a tensor program optimizer that discovers optimal kernel orchestration strategies for tensor programs. Instead of directly fusing operators, Korch first applies operator fission to decompose tensor operators into a small set of basic tensor algebra primitives. This decomposition enables a diversity of fine-grained, inter-operator optimizations. Next, Korch optimizes kernel orchestration by formalizing it as a constrained optimization problem, leveraging an off-the-shelf binary linear programming solver to discover an optimal orchestration strategy, and generating an executable that can be directly deployed on modern GPU platforms. Evaluation on a variety of DNNs shows that Korch outperforms existing tensor program optimizers by up to 1.7x on V100 GPUs and up to 1.6x on A100 GPUs. Korch is publicly available at https://github.com/humuyan/Korch.

Optimal Kernel Orchestration for Tensor Programs with Korch

TL;DR

Korch reframes tensor-program optimization by decomposing DNN operators into fine-grained tensor primitives (operator fission) and then solving kernel orchestration as a binary linear programming problem to minimize end-to-end latency on GPUs. By generating a primitive graph and exhaustively profiling candidate kernels, Korch enables cross-operator optimizations and even allows repeated execution of primitives to reduce kernel-launch and memory overhead. Empirical results show up to 1.7× speedups on V100 and 1.6× on A100 across diverse CNN and vision-transformer workloads, with notable gains from case studies such as EfficientViT attention blocks. The approach offers a principled, solver-based path to kernel-level optimization that complements existing graph- and rule-based fusion methods, and it is publicly available for reproduction and extension.

Abstract

Kernel orchestration is the task of mapping the computation defined in different operators of a deep neural network (DNN) to the execution of GPU kernels on modern hardware platforms. Prior approaches optimize kernel orchestration by greedily applying operator fusion, which fuses the computation of multiple operators into a single kernel, and miss a variety of optimization opportunities in kernel orchestration. This paper presents Korch, a tensor program optimizer that discovers optimal kernel orchestration strategies for tensor programs. Instead of directly fusing operators, Korch first applies operator fission to decompose tensor operators into a small set of basic tensor algebra primitives. This decomposition enables a diversity of fine-grained, inter-operator optimizations. Next, Korch optimizes kernel orchestration by formalizing it as a constrained optimization problem, leveraging an off-the-shelf binary linear programming solver to discover an optimal orchestration strategy, and generating an executable that can be directly deployed on modern GPU platforms. Evaluation on a variety of DNNs shows that Korch outperforms existing tensor program optimizers by up to 1.7x on V100 GPUs and up to 1.6x on A100 GPUs. Korch is publicly available at https://github.com/humuyan/Korch.
Paper Structure (53 sections, 1 theorem, 9 equations, 13 figures, 2 tables, 1 algorithm)

This paper contains 53 sections, 1 theorem, 9 equations, 13 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

A set of nodes $\mathcal{P}' \subseteq \mathcal{P}$ forms a convex subgraph of $\mathcal{G} = (\mathcal{P}, \mathcal{E})$ if and only if $\mathcal{P}'$ is the difference of two execution states $\mathcal{P}_1$ and $\mathcal{P}_2$: $\mathcal{P}' = \mathcal{P}_1 \setminus \mathcal{P}_2$.

Figures (13)

  • Figure 1: An overview of Korch.
  • Figure 2: Operator fission enables subsequent optimizing transformations on primitive graphs. In \ref{['fig:primitive_graph']}, the dotted boxes in the same color indicate a transformation on the primitive graph. The combination of the three transformations fuses the reduce primitive in Softmax and the subsequent MatMul into a single MatMul.
  • Figure 3: The operator fission rule for Softmax.
  • Figure 4: An example of kernel orchestration for a subgraph of self attention vaswani2017attention. The shadowed subgraph in \ref{['fig:primitive_graph_kernel_mapping']} is similar with the primitive graph transformation result in \ref{['fig:fission']}.
  • Figure 5: Comparing memory bandwidth and floating-point throughput across GPU generations. The y-axis is normalized to the performance of P100 GPUs.
  • ...and 8 more figures

Theorems & Definitions (5)

  • Definition 1: Convex subgraph
  • Definition 2: Execution state
  • Theorem 1
  • proof
  • Definition 3: Possible output set