Table of Contents
Fetching ...

BladeDISC++: Memory Optimizations Based On Symbolic Shape

Xiulong Yuan, Xu Yan, Wenting Shen, Xiafei Qiu, Ang Wang, Jie Zhang, Yong Li, Wei Lin

TL;DR

BladeDISC++ tackles memory optimization for dynamic shape graphs where exact tensor shapes are unavailable, addressing the challenge with symbolic shapes. It builds a global symbolic shape graph and uses $SymbolicDim$ and $SymbolicExpr$ to compare memory footprints of different op sequences and candidate rematerialization subgraphs. The approach performs op scheduling and rematerialization under a compilation-runtime hybrid strategy, inserting $EvictOp$ and $Remat::RegenerateOps$ at compile time and making final decisions at runtime. Evaluations on Llama-2-1b with CodeAlpaca-20K show meaningful memory reductions for dynamic shapes, achieving memory usage close to static-shape training while improving end-to-end performance and reducing recompilation overhead.

Abstract

Recent deep learning workloads exhibit dynamic characteristics, leading to the rising adoption of dynamic shape compilers. These compilers can generate efficient kernels for dynamic shape graphs characterized by a fixed graph topology and uncertain tensor shapes. However, memory optimization, although particularly crucial in this large model era, remains relatively underexplored for dynamic shape graphs. The fundamental challenge lies in the lack of precise tensor shapes which are essential in conventional methods such as operation scheduling(op scheduling) and rematerialization. To address this challenge, we propose op scheduling and rematerialization approaches based on symbolic shapes and developed BladeDISC++. Besides, since rematerialization decisions cannot be made solely at compile time when tensor shapes are unknown, BladeDISC++ employs a compilation-runtime combined strategy to optimally address shape dynamics. Evaluations indicate that BladeDISC++ effectively reduces memory usage for dynamic shape graphs, achieving memory consumption comparable to optimizations using precise shapes, thereby promoting the broader adoption of dynamic shape compilers.

BladeDISC++: Memory Optimizations Based On Symbolic Shape

TL;DR

BladeDISC++ tackles memory optimization for dynamic shape graphs where exact tensor shapes are unavailable, addressing the challenge with symbolic shapes. It builds a global symbolic shape graph and uses and to compare memory footprints of different op sequences and candidate rematerialization subgraphs. The approach performs op scheduling and rematerialization under a compilation-runtime hybrid strategy, inserting and at compile time and making final decisions at runtime. Evaluations on Llama-2-1b with CodeAlpaca-20K show meaningful memory reductions for dynamic shapes, achieving memory usage close to static-shape training while improving end-to-end performance and reducing recompilation overhead.

Abstract

Recent deep learning workloads exhibit dynamic characteristics, leading to the rising adoption of dynamic shape compilers. These compilers can generate efficient kernels for dynamic shape graphs characterized by a fixed graph topology and uncertain tensor shapes. However, memory optimization, although particularly crucial in this large model era, remains relatively underexplored for dynamic shape graphs. The fundamental challenge lies in the lack of precise tensor shapes which are essential in conventional methods such as operation scheduling(op scheduling) and rematerialization. To address this challenge, we propose op scheduling and rematerialization approaches based on symbolic shapes and developed BladeDISC++. Besides, since rematerialization decisions cannot be made solely at compile time when tensor shapes are unknown, BladeDISC++ employs a compilation-runtime combined strategy to optimally address shape dynamics. Evaluations indicate that BladeDISC++ effectively reduces memory usage for dynamic shape graphs, achieving memory consumption comparable to optimizations using precise shapes, thereby promoting the broader adoption of dynamic shape compilers.

Paper Structure

This paper contains 7 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Memory optimizations based on symbolic shapes in BladeDISC++
  • Figure 2: OpScheduler algorithm main loop