Deep Kernel Fusion for Transformers
Zixi Zhang, Zhiwen Mo, Yiren Zhao, Robert Mullins
TL;DR
This work tackles memory-bandwidth bottlenecks in agentic LLM inference, where long contexts cause SwiGLU MLP blocks to dominate memory traffic. It proposes DeepFusionKernel, a deeply fused CUDA operator that combines SwiGLU computations into a single kernel, integrated with SGLang and a profiler-driven scheduler to adapt to workload and hardware. Empirical results show consistent throughput gains, achieving up to 9.7% speedup on A100 and 13.2% on H100 across various batch sizes and long-generation scenarios, demonstrating practical gains for bandwidth-bound decoding. The approach offers a deployable path to better utilize GPU compute by minimizing memory traffic, with robust performance across models, configurations, and hardware platforms.
Abstract
Agentic LLM inference with long contexts is increasingly limited by memory bandwidth rather than compute. In this setting, SwiGLU MLP blocks, whose large weights exceed cache capacity, become a major yet under-optimized bottleneck. We propose DeepFusionKernel, a deeply fused kernel that cuts HBM traffic and boosts cache reuse, delivering up to 13.2% speedup on H100 and 9.7% on A100 over SGLang. Integrated with SGLang and paired with a kernel scheduler, DeepFusionKernel ensures consistent accelerations over generation lengths, while remaining adaptable to diverse models, inference configurations, and hardware platforms.
