Table of Contents
Fetching ...

Relax: Composable Abstractions for End-to-End Dynamic Machine Learning

Ruihang Lai, Junru Shao, Siyuan Feng, Steven S. Lyubomirsky, Bohan Hou, Wuwei Lin, Zihao Ye, Hongyi Jin, Yuchen Jin, Jiawei Liu, Lesheng Jin, Yaxing Cai, Ziheng Jiang, Yong Wu, Sunghyun Park, Prakalp Srivastava, Jared G. Roesch, Todd C. Mowry, Tianqi Chen

TL;DR

Relax tackles dynamic shape computations in end-to-end ML workloads by introducing a cross-level compiler abstraction that unifies graphs, tensor programs, and libraries, together with first-class symbolic shape annotations. This enables dynamic-shape-aware analyses and optimizations across boundaries, including memory planning, operator fusion, workspace lifting, and CUDA Graph offloading, within an ahead-of-time compilation framework. The authors implement Relax on top of TVM and demonstrate competitive LLM inference performance across GPUs and emerging devices (mobile, embedded, WebGPU) while expanding deployability to new backends. The work's key contributions are the cross-level abstraction, symbolic shape deduction, and a concrete optimization pipeline that yields measurable memory and latency benefits.

Abstract

Dynamic shape computations have become critical in modern machine learning workloads, especially in emerging large language models. The success of these models has driven the demand for their universal deployment across a diverse set of backend environments. In this paper, we present Relax, a compiler abstraction for optimizing end-to-end dynamic machine learning workloads. Relax introduces a cross-level abstraction that encapsulates computational graphs, loop-level tensor programs, and external library calls in a single representation. Relax also introduces first-class symbolic shape annotations to track dynamic shape computations globally across the program, enabling dynamic shape-aware cross-level optimizations. We build an end-to-end compilation framework using the proposed approach to optimize dynamic shape models. Experimental results on LLMs show that Relax delivers performance competitive with state-of-the-art systems across various GPUs and enables deployment of emerging models to a broader set of emerging environments, including mobile phones, embedded devices, and web browsers.

Relax: Composable Abstractions for End-to-End Dynamic Machine Learning

TL;DR

Relax tackles dynamic shape computations in end-to-end ML workloads by introducing a cross-level compiler abstraction that unifies graphs, tensor programs, and libraries, together with first-class symbolic shape annotations. This enables dynamic-shape-aware analyses and optimizations across boundaries, including memory planning, operator fusion, workspace lifting, and CUDA Graph offloading, within an ahead-of-time compilation framework. The authors implement Relax on top of TVM and demonstrate competitive LLM inference performance across GPUs and emerging devices (mobile, embedded, WebGPU) while expanding deployability to new backends. The work's key contributions are the cross-level abstraction, symbolic shape deduction, and a concrete optimization pipeline that yields measurable memory and latency benefits.

Abstract

Dynamic shape computations have become critical in modern machine learning workloads, especially in emerging large language models. The success of these models has driven the demand for their universal deployment across a diverse set of backend environments. In this paper, we present Relax, a compiler abstraction for optimizing end-to-end dynamic machine learning workloads. Relax introduces a cross-level abstraction that encapsulates computational graphs, loop-level tensor programs, and external library calls in a single representation. Relax also introduces first-class symbolic shape annotations to track dynamic shape computations globally across the program, enabling dynamic shape-aware cross-level optimizations. We build an end-to-end compilation framework using the proposed approach to optimize dynamic shape models. Experimental results on LLMs show that Relax delivers performance competitive with state-of-the-art systems across various GPUs and enables deployment of emerging models to a broader set of emerging environments, including mobile phones, embedded devices, and web browsers.
Paper Structure (21 sections, 20 figures, 2 tables, 3 algorithms)

This paper contains 21 sections, 20 figures, 2 tables, 3 algorithms.

Figures (20)

  • Figure 1: Overview of our approach. We also present a cross-level abstraction that encapsulates the computational graph, foreign tensor program and the external library function levels. We introduce first-class symbolic shape annotations to track dynamic shape computations globally across the program, and enable dynamic shape--aware optimizations across levels.
  • Figure 2: Key elements of Relax abstraction.
  • Figure 3: Comparison of first-class symbolic shape annotation with unknown dynamic shape annotation. First-class symbolic shape enables comprehensive symbolic analysis and facilitates advanced dynamic shape--aware optimizations.
  • Figure 4: Cross-level abstractions: Graph-level function calls and communicates with loop-level TensorIR using call_tir, and invokes library functions via call_dps_library.
  • Figure 5: The semantics explanation of call_tir.
  • ...and 15 more figures