Table of Contents
Fetching ...

Chiplet Cloud: Building AI Supercomputers for Serving Large Generative Language Models

Huwan Peng, Scott Davidson, Richard Shi, Shuaiwen Leon Song, Michael Taylor

TL;DR

This work tackles the escalating cost of serving large generative language models by introducing Chiplet Cloud, a chiplet-based ASIC LLM supercomputer with a dedicated on-chip memory system (CC-MEM) and sparsity support. It combines a two-phase hardware-software co-design methodology to exhaustively search design space and optimize mapping across eight LLMs, achieving up to $97\times$ and $18\times$ TCO/Token improvements over GPU and TPU clouds, respectively. The key contributions include CC-MEM with a Store-as-Compressed, Load-as-Dense scheme for sparsity, a scalable chiplet-based architecture, and a rigorous methodology that links hardware choices to software mapping for end-to-end TCO optimization. The results demonstrate strong potential to democratize access to modern LLMs by dramatically reducing the cost per generated token in cloud deployments.

Abstract

Large language models (LLMs) such as OpenAI's ChatGPT and Google's Gemini have demonstrated unprecedented capabilities of autoregressive AI models across multiple tasks triggering disruptive technology innovations around the world. However, as models continue to grow the cost to serve these models also continues to grow threatening the democratization of LLMs. To address this issue, we propose Chiplet Cloud, a chiplet-based ASIC LLM-supercomputer architecture whose goal is to optimize the total cost of ownership (TCO) per generated token. This architecture is a highly parameterizable ASIC and server-level architecture leveraging thousands of replicated accelerator modules collaborating to scale-up the performance of LLMs at cloud-scale. To determine specific parameterizations of the Chiplet Cloud architecture, we implemented a two-phase hardware-software co-design methodology that can search the massive design space and fine tune the architecture across a collection of LLMs based on an accurate inference simulation. A common bottleneck for LLMs is the memory access performance therefore we introduce CC-MEM, a scalable on-chip memory system for Chiplet Cloud architectures. Using the CC-MEM, Chiplet Clouds can be built using only SRAMs for design points where the power and performance of memory access is critical. The CC-MEM also includes a compression decoder module to add support for sparse models without impacting the compute units using a Store-as-Compressed, Load-as-Dense mechanism. We evaluate Chiplet Cloud architectures across eight popular LLMs. Using fine tuned Chiplet Cloud servers we are able to achieve $97\times$ and $18\times$ improvement in TCO/Token over rented GPU and TPU clouds, or a $8.3\times$ and $3.7\times$ improvement over fabricated GPU and TPU clouds respectively. Chiplet Cloud can also support $1.7\times$ larger models with a sparsity of 60\%.

Chiplet Cloud: Building AI Supercomputers for Serving Large Generative Language Models

TL;DR

This work tackles the escalating cost of serving large generative language models by introducing Chiplet Cloud, a chiplet-based ASIC LLM supercomputer with a dedicated on-chip memory system (CC-MEM) and sparsity support. It combines a two-phase hardware-software co-design methodology to exhaustively search design space and optimize mapping across eight LLMs, achieving up to and TCO/Token improvements over GPU and TPU clouds, respectively. The key contributions include CC-MEM with a Store-as-Compressed, Load-as-Dense scheme for sparsity, a scalable chiplet-based architecture, and a rigorous methodology that links hardware choices to software mapping for end-to-end TCO optimization. The results demonstrate strong potential to democratize access to modern LLMs by dramatically reducing the cost per generated token in cloud deployments.

Abstract

Large language models (LLMs) such as OpenAI's ChatGPT and Google's Gemini have demonstrated unprecedented capabilities of autoregressive AI models across multiple tasks triggering disruptive technology innovations around the world. However, as models continue to grow the cost to serve these models also continues to grow threatening the democratization of LLMs. To address this issue, we propose Chiplet Cloud, a chiplet-based ASIC LLM-supercomputer architecture whose goal is to optimize the total cost of ownership (TCO) per generated token. This architecture is a highly parameterizable ASIC and server-level architecture leveraging thousands of replicated accelerator modules collaborating to scale-up the performance of LLMs at cloud-scale. To determine specific parameterizations of the Chiplet Cloud architecture, we implemented a two-phase hardware-software co-design methodology that can search the massive design space and fine tune the architecture across a collection of LLMs based on an accurate inference simulation. A common bottleneck for LLMs is the memory access performance therefore we introduce CC-MEM, a scalable on-chip memory system for Chiplet Cloud architectures. Using the CC-MEM, Chiplet Clouds can be built using only SRAMs for design points where the power and performance of memory access is critical. The CC-MEM also includes a compression decoder module to add support for sparse models without impacting the compute units using a Store-as-Compressed, Load-as-Dense mechanism. We evaluate Chiplet Cloud architectures across eight popular LLMs. Using fine tuned Chiplet Cloud servers we are able to achieve and improvement in TCO/Token over rented GPU and TPU clouds, or a and improvement over fabricated GPU and TPU clouds respectively. Chiplet Cloud can also support larger models with a sparsity of 60\%.
Paper Structure (28 sections, 6 equations, 15 figures, 2 tables)

This paper contains 28 sections, 6 equations, 15 figures, 2 tables.

Figures (15)

  • Figure 1: Compared to conventional systems, Chiplet Cloud (1) fits all model parameters inside the on-chip CC-MEM, greatly improving the performance; (2) co-optimizes the chip size with software mapping to reduce TCO/Perf; (3) exploits sparsity to reduce TCO and support larger models.
  • Figure 2: General architecture of an autoregressive generative large language model. In most LLMs, $d$ is significantly larger than $l_{ctx}$, causing FC layers to dominate the overall runtime. The inference is partitioned into two stages: prompt processing (prefill) and token generation (generate).
  • Figure 3: Chiplet Cloud architecture from the CC-MEM to the cloud.
  • Figure 4: Compression decoder unit.
  • Figure 5: Two phase design methodology flow diagram. (a) The hardware exploration flow performs a bottom-up, LLM agnostic design space exploration generating thousands of realizable Chiplet Cloud server designs. (b) The software evaluation flow then takes the realizable server design points along with a generative LLM specification to perform software optimized inference simulations and TCO estimations to find the optimal Chiplet Cloud design points.
  • ...and 10 more figures