Table of Contents
Fetching ...

Vortex: Efficient Sample-Free Dynamic Tensor Program Optimization via Hardware-aware Strategy Space Hierarchization

Yangjie Zhou, Honglin Zhu, Qian Qiu, Weihao Cui, Zihan Liu, Cong Guo, Siyuan Feng, Jintao Meng, Haidong Lan, Jingwen Leng, Wenxi Zhu, Minwen Deng

TL;DR

Vortex is a hardware-driven and sample-free compiler tailored for dynamic-shape tensor programs that features a unique bidirectional compilation workflow, combining top-down abstraction for aligning tensor program execution with hardware hierarchies and bottom-up kernel construction to narrow the search space, enabling Vortex to achieve remarkable efficiency.

Abstract

Dynamic-shape deep neural networks (DNNs) are rapidly evolving, attracting attention for their ability to handle variable input sizes in real-time applications. However, existing compilation optimization methods for such networks often rely heavily on predefined samples to guide the compilation process, which restricts their adaptability and efficiency. These sample-driven methods struggle to efficiently manage the diverse and unpredictable shapes encountered in real-world scenarios, often resulting in suboptimal performance. To tackle these issues, we introduce Vortex, a hardware-driven and sample-free compiler tailored for dynamic-shape tensor programs. Vortex capitalizes on detailed hardware information and hierarchizes the strategy space to facilitate high-performance code generation without relying on runtime shape samples. It features a unique bidirectional compilation workflow, combining top-down abstraction for aligning tensor program execution with hardware hierarchies and bottom-up kernel construction to narrow the search space, enabling Vortex to achieve remarkable efficiency. Comprehensive evaluations confirm that Vortex reduces compilation time by $176\times$ compared to the existing dynamic-shape compiler. Additionally, it substantially outperforms existing vendor-provided libraries and dynamic-shape compilers on both CPU and GPU platforms, delivering speedups of $2.53\times$ and $3.01\times$, respectively.

Vortex: Efficient Sample-Free Dynamic Tensor Program Optimization via Hardware-aware Strategy Space Hierarchization

TL;DR

Vortex is a hardware-driven and sample-free compiler tailored for dynamic-shape tensor programs that features a unique bidirectional compilation workflow, combining top-down abstraction for aligning tensor program execution with hardware hierarchies and bottom-up kernel construction to narrow the search space, enabling Vortex to achieve remarkable efficiency.

Abstract

Dynamic-shape deep neural networks (DNNs) are rapidly evolving, attracting attention for their ability to handle variable input sizes in real-time applications. However, existing compilation optimization methods for such networks often rely heavily on predefined samples to guide the compilation process, which restricts their adaptability and efficiency. These sample-driven methods struggle to efficiently manage the diverse and unpredictable shapes encountered in real-world scenarios, often resulting in suboptimal performance. To tackle these issues, we introduce Vortex, a hardware-driven and sample-free compiler tailored for dynamic-shape tensor programs. Vortex capitalizes on detailed hardware information and hierarchizes the strategy space to facilitate high-performance code generation without relying on runtime shape samples. It features a unique bidirectional compilation workflow, combining top-down abstraction for aligning tensor program execution with hardware hierarchies and bottom-up kernel construction to narrow the search space, enabling Vortex to achieve remarkable efficiency. Comprehensive evaluations confirm that Vortex reduces compilation time by compared to the existing dynamic-shape compiler. Additionally, it substantially outperforms existing vendor-provided libraries and dynamic-shape compilers on both CPU and GPU platforms, delivering speedups of and , respectively.
Paper Structure (37 sections, 4 equations, 16 figures, 7 tables, 2 algorithms)

This paper contains 37 sections, 4 equations, 16 figures, 7 tables, 2 algorithms.

Figures (16)

  • Figure 1: Comparison of $Vortex$ with existing methods.
  • Figure 2: Existing sample-driven compilation workflow.
  • Figure 3: Comparing DietCode and cuBLAS over various sequence lengths on A100 GPU. 'DietCode-I' and 'DietCode-O' represent DietCode's dynamic input configurations inside and outside the tuning sample list, respectively.
  • Figure 4: CPU/GPU Diagram.
  • Figure 5: GEMM performance across different hardware resource usages on 8255c CPU and A100 GPU. Legend indicates corresponding GEMM parameters M, N, and K.
  • ...and 11 more figures