AI and Memory Wall
Amir Gholami, Zhewei Yao, Sehoon Kim, Coleman Hooper, Michael W. Mahoney, Kurt Keutzer
TL;DR
The paper tackles memory bandwidth bottlenecks in large Transformer models, with a focus on serving rather than just training. By analyzing both encoder and decoder Transformer workloads, it demonstrates that memory bandwidth can dominate the bottleneck for decoder models. It argues for a redesign of model architecture, training strategies, and deployment methods to mitigate memory limitations. The work provides a roadmap for memory-aware AI systems that can sustain growth in model size and latency requirements.
Abstract
The availability of unprecedented unsupervised training data, along with neural scaling laws, has resulted in an unprecedented surge in model size and compute requirements for serving/training LLMs. However, the main performance bottleneck is increasingly shifting to memory bandwidth. Over the past 20 years, peak server hardware FLOPS has been scaling at 3.0x/2yrs, outpacing the growth of DRAM and interconnect bandwidth, which have only scaled at 1.6 and 1.4 times every 2 years, respectively. This disparity has made memory, rather than compute, the primary bottleneck in AI applications, particularly in serving. Here, we analyze encoder and decoder Transformer models and show how memory bandwidth can become the dominant bottleneck for decoder models. We argue for a redesign in model architecture, training, and deployment strategies to overcome this memory limitation.
