Table of Contents
Fetching ...

Gensor: A Graph-based Construction Tensor Compilation Method for Deep Learning

Hangda Liu, Boyu Diao, Yu Yang, Wenxin Chen, Xiaohui Peng, Yongjun Xu

TL;DR

The paper presents Gensor, a graph-based construction tensor compilation method that replaces traditional tree-based optimization with a graph traversal informed by Markov analysis. By representing tensor programs as ETIR states and scheduling primitives as edges, Gensor expands the optimization space and enables principled, hardware-aware exploration, including virtual thread scheduling. Empirical results on GPUs show consistent performance gains (average up to 18–30%) over state-of-the-art constructive methods and competitive results with learning-based search approaches, while achieving much faster compilation than searching-based methods. Gensor demonstrates strong operator and end-to-end model improvements, including robust performance for dynamic input shapes, suggesting practical value for cloud and edge deployments. Overall, Gensor offers a scalable, agile framework that balances compilation speed and kernel performance for modern deep learning workloads.

Abstract

High-performance deep learning depends on efficient tensor programs. In recent years, automatic tensor program optimization, also known as tensor compilation, has emerged as the primary approach to generating efficient tensor programs. However, how to generate kernels with higher performance in a shorter time is still the key challenge. In this paper, we present Gensor, a graph-based construction tensor compilation method for deep learning, to further improve the performance of construction tensor compilation. Unlike existing tree-based methods, Gensor abstracts construction space into a graph structure. Gensor then explores the construction space with Markov analysis. Gensor takes tensor programs as states and models scheduling primitives as transition actions between these states. Therefore, the process of tensor program construction optimization is abstracted as a graph traversal process. This approach expands the optimization space, improving operator performance while ensuring rapid optimization. Extensive experiments with typical operators demonstrate that Gensor significantly outperforms the state-of-the-art methods on GPUs for both cloud servers and edge devices. As a result, Gensor can generate operator kernels in seconds, with performance increasing by 18\% on average, reaching a maximum of 30\%. It also achieves high speedup for end-to-end models like ResNet-50 and GPT-2, with an average acceleration of 20\%.

Gensor: A Graph-based Construction Tensor Compilation Method for Deep Learning

TL;DR

The paper presents Gensor, a graph-based construction tensor compilation method that replaces traditional tree-based optimization with a graph traversal informed by Markov analysis. By representing tensor programs as ETIR states and scheduling primitives as edges, Gensor expands the optimization space and enables principled, hardware-aware exploration, including virtual thread scheduling. Empirical results on GPUs show consistent performance gains (average up to 18–30%) over state-of-the-art constructive methods and competitive results with learning-based search approaches, while achieving much faster compilation than searching-based methods. Gensor demonstrates strong operator and end-to-end model improvements, including robust performance for dynamic input shapes, suggesting practical value for cloud and edge deployments. Overall, Gensor offers a scalable, agile framework that balances compilation speed and kernel performance for modern deep learning workloads.

Abstract

High-performance deep learning depends on efficient tensor programs. In recent years, automatic tensor program optimization, also known as tensor compilation, has emerged as the primary approach to generating efficient tensor programs. However, how to generate kernels with higher performance in a shorter time is still the key challenge. In this paper, we present Gensor, a graph-based construction tensor compilation method for deep learning, to further improve the performance of construction tensor compilation. Unlike existing tree-based methods, Gensor abstracts construction space into a graph structure. Gensor then explores the construction space with Markov analysis. Gensor takes tensor programs as states and models scheduling primitives as transition actions between these states. Therefore, the process of tensor program construction optimization is abstracted as a graph traversal process. This approach expands the optimization space, improving operator performance while ensuring rapid optimization. Extensive experiments with typical operators demonstrate that Gensor significantly outperforms the state-of-the-art methods on GPUs for both cloud servers and edge devices. As a result, Gensor can generate operator kernels in seconds, with performance increasing by 18\% on average, reaching a maximum of 30\%. It also achieves high speedup for end-to-end models like ResNet-50 and GPT-2, with an average acceleration of 20\%.

Paper Structure

This paper contains 16 sections, 6 equations, 12 figures, 6 tables, 2 algorithms.

Figures (12)

  • Figure 1: Unidirectional tree structure with one single objective using Roller. The red arrow indicates the solution identified by Roller. The green arrow indicates a solution with higher FLOPS (floating point operations per second), representing higher GPU throughput, namely better performance. The performance difference between the two solutions is 9%.
  • Figure 2: Overview of Gensor.
  • Figure 3: Diagram of virtual threads in ETIR.
  • Figure 4: Illustration of Gensor. The blue blocks represent the nodes, namely the possible tensor programs in the construction graph. The green arrows represent the edges, namely the possible scheduling primitives in the construction graph. The Memory Level means the order of cache levels in the target hardware. A higher level means the memory is closer to computing units.
  • Figure 5: Illustration of Actions. Each color represents a tile corresponding to the elements that a thread needs to compute.
  • ...and 7 more figures