Table of Contents
Fetching ...

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

Haojun Xia, Zhen Zheng, Yuchao Li, Donglin Zhuang, Zhongzhu Zhou, Xiafei Qiu, Yong Li, Wei Lin, Shuaiwen Leon Song

TL;DR

Flash-LLM tackles the memory bandwidth bottleneck in large generative model inference by enabling unstructured sparsity on Tensor Cores through a Load-as-Sparse, Compute-as-Dense paradigm. It introduces the Tiled-CSL sparse format and a two-level, overlapped computation pipeline that orchestrates sparse data extraction, dense data loading, and tensor-core computation. The framework achieves substantial kernel-level speedups over Sputnik and SparTA and delivers strong end-to-end throughput improvements across OPT-30B/66B/175B models, integrating with FasterTransformer for practical deployment. These results indicate a practical path to cost-effective, scalable LLM inference with moderate unstructured sparsity on contemporary GPUs.

Abstract

With the fast growth of parameter size, it becomes increasingly challenging to deploy large generative models as they typically require large GPU memory consumption and massive computation. Unstructured model pruning has been a common approach to reduce both GPU memory footprint and the overall computation while retaining good model accuracy. However, the existing solutions do not provide a highly-efficient support for handling unstructured sparsity on modern GPUs, especially on the highly-structured Tensor Core hardware. Therefore, we propose Flash-LLM for enabling low-cost and highly-efficient large generative model inference with the sophisticated support of unstructured sparsity on high-performance but highly restrictive Tensor Cores. Based on our key observation that the main bottleneck of generative model inference is the several skinny matrix multiplications for which Tensor Cores would be significantly under-utilized due to low computational intensity, we propose a general Load-as-Sparse and Compute-as-Dense methodology for unstructured sparse matrix multiplication. The basic insight is to address the significant memory bandwidth bottleneck while tolerating redundant computations that are not critical for end-to-end performance on Tensor Cores. Based on this, we design an effective software framework for Tensor Core based unstructured SpMM, leveraging on-chip resources for efficient sparse data extraction and computation/memory-access overlapping. At SpMM kernel level, Flash-LLM significantly outperforms the state-of-the-art library, i.e., Sputnik and SparTA by an average of 2.9x and 1.5x, respectively. At end-to-end framework level on OPT-30B/66B/175B models, for tokens per GPU-second, Flash-LLM achieves up to 3.8x and 3.6x improvement over DeepSpeed and FasterTransformer, respectively, with significantly lower inference cost.

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity

TL;DR

Flash-LLM tackles the memory bandwidth bottleneck in large generative model inference by enabling unstructured sparsity on Tensor Cores through a Load-as-Sparse, Compute-as-Dense paradigm. It introduces the Tiled-CSL sparse format and a two-level, overlapped computation pipeline that orchestrates sparse data extraction, dense data loading, and tensor-core computation. The framework achieves substantial kernel-level speedups over Sputnik and SparTA and delivers strong end-to-end throughput improvements across OPT-30B/66B/175B models, integrating with FasterTransformer for practical deployment. These results indicate a practical path to cost-effective, scalable LLM inference with moderate unstructured sparsity on contemporary GPUs.

Abstract

With the fast growth of parameter size, it becomes increasingly challenging to deploy large generative models as they typically require large GPU memory consumption and massive computation. Unstructured model pruning has been a common approach to reduce both GPU memory footprint and the overall computation while retaining good model accuracy. However, the existing solutions do not provide a highly-efficient support for handling unstructured sparsity on modern GPUs, especially on the highly-structured Tensor Core hardware. Therefore, we propose Flash-LLM for enabling low-cost and highly-efficient large generative model inference with the sophisticated support of unstructured sparsity on high-performance but highly restrictive Tensor Cores. Based on our key observation that the main bottleneck of generative model inference is the several skinny matrix multiplications for which Tensor Cores would be significantly under-utilized due to low computational intensity, we propose a general Load-as-Sparse and Compute-as-Dense methodology for unstructured sparse matrix multiplication. The basic insight is to address the significant memory bandwidth bottleneck while tolerating redundant computations that are not critical for end-to-end performance on Tensor Cores. Based on this, we design an effective software framework for Tensor Core based unstructured SpMM, leveraging on-chip resources for efficient sparse data extraction and computation/memory-access overlapping. At SpMM kernel level, Flash-LLM significantly outperforms the state-of-the-art library, i.e., Sputnik and SparTA by an average of 2.9x and 1.5x, respectively. At end-to-end framework level on OPT-30B/66B/175B models, for tokens per GPU-second, Flash-LLM achieves up to 3.8x and 3.6x improvement over DeepSpeed and FasterTransformer, respectively, with significantly lower inference cost.
Paper Structure (29 sections, 3 equations, 16 figures, 1 table, 3 algorithms)

This paper contains 29 sections, 3 equations, 16 figures, 1 table, 3 algorithms.

Figures (16)

  • Figure 1: (a) Generative model inference; (b) KV-Cache.
  • Figure 2: Decoder Layer Architecture. The H here means the hidden dimension aka. model dimension, which equals 12K for GPT-3. The B refers to the inference batch size which is typically small for real-time inference, e.g. 8, 16 or 32.
  • Figure 3: Performance of an unstructured SpMM (M/K/N =hidden_size*4/hidden_size/batch_size=36K/9K/8) under different designs on GPU. SIMT core centric designs are indicated with dash lines while tensor core centric designs are indicated with solid lines (including our solution Flash-LLM).
  • Figure 4: GPU utilization Breakdown. The MatMuls profiled in this figure are the most time-consuming parts during OPT-66B inference (with 2 GPUs) at batch sizes 16, 32, 64, and 128.
  • Figure 5: Roofline model for skinny MatMuls. The solid Squares refer to the CI and the performance upper bound for dense solutions (e.g. cuBLAS), while the solid Stars represent the improved CI and the new performance bound with our Load-as-Sparse Compute-as-Dense. Note that the vertical axis is displayed on a logarithmic scale.
  • ...and 11 more figures