Table of Contents
Fetching ...

Deep Kernel Fusion for Transformers

Zixi Zhang, Zhiwen Mo, Yiren Zhao, Robert Mullins

TL;DR

This work tackles memory-bandwidth bottlenecks in agentic LLM inference, where long contexts cause SwiGLU MLP blocks to dominate memory traffic. It proposes DeepFusionKernel, a deeply fused CUDA operator that combines SwiGLU computations into a single kernel, integrated with SGLang and a profiler-driven scheduler to adapt to workload and hardware. Empirical results show consistent throughput gains, achieving up to 9.7% speedup on A100 and 13.2% on H100 across various batch sizes and long-generation scenarios, demonstrating practical gains for bandwidth-bound decoding. The approach offers a deployable path to better utilize GPU compute by minimizing memory traffic, with robust performance across models, configurations, and hardware platforms.

Abstract

Agentic LLM inference with long contexts is increasingly limited by memory bandwidth rather than compute. In this setting, SwiGLU MLP blocks, whose large weights exceed cache capacity, become a major yet under-optimized bottleneck. We propose DeepFusionKernel, a deeply fused kernel that cuts HBM traffic and boosts cache reuse, delivering up to 13.2% speedup on H100 and 9.7% on A100 over SGLang. Integrated with SGLang and paired with a kernel scheduler, DeepFusionKernel ensures consistent accelerations over generation lengths, while remaining adaptable to diverse models, inference configurations, and hardware platforms.

Deep Kernel Fusion for Transformers

TL;DR

This work tackles memory-bandwidth bottlenecks in agentic LLM inference, where long contexts cause SwiGLU MLP blocks to dominate memory traffic. It proposes DeepFusionKernel, a deeply fused CUDA operator that combines SwiGLU computations into a single kernel, integrated with SGLang and a profiler-driven scheduler to adapt to workload and hardware. Empirical results show consistent throughput gains, achieving up to 9.7% speedup on A100 and 13.2% on H100 across various batch sizes and long-generation scenarios, demonstrating practical gains for bandwidth-bound decoding. The approach offers a deployable path to better utilize GPU compute by minimizing memory traffic, with robust performance across models, configurations, and hardware platforms.

Abstract

Agentic LLM inference with long contexts is increasingly limited by memory bandwidth rather than compute. In this setting, SwiGLU MLP blocks, whose large weights exceed cache capacity, become a major yet under-optimized bottleneck. We propose DeepFusionKernel, a deeply fused kernel that cuts HBM traffic and boosts cache reuse, delivering up to 13.2% speedup on H100 and 9.7% on A100 over SGLang. Integrated with SGLang and paired with a kernel scheduler, DeepFusionKernel ensures consistent accelerations over generation lengths, while remaining adaptable to diverse models, inference configurations, and hardware platforms.
Paper Structure (14 sections, 2 equations, 2 figures, 2 tables)

This paper contains 14 sections, 2 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: DeepFusionKernel leverages aggressive kernel fusion on SwiGLU blocks to eliminate intermediate activations and reduce memory traffic--without increasing FLOPs. Panels show implementation layouts: (a) naive PyTorch with four kernel launches; (b) the two-kernel design used by SGLang and vLLM; and (c) our single, deeply fused kernel that streams data through GEMMs and nonlinearities to avoid extra loads/stores. Speedup over PyTorch and SGLang / vLLM is displayed in (d). By removing these redundant reads/writes, DeepFusionKernel yields up to 9.7% and 13.2% throughput improvements on A100 and H100 GPUs, respectively, on bandwidth-bound autoregressive decoding workloads.
  • Figure 2: A matrix splitting scheme of consecutive matrix multiplications under tensor parallelism (TP) that contains a single all-reduce operation, denoted as $\sum$.