Table of Contents
Fetching ...

PIM-AI: A Novel Architecture for High-Efficiency LLM Inference

Cristobal Ortega, Yann Falevoz, Renaud Ayrignac

TL;DR

This work tackles the data-transfer bottlenecks and energy demands of large language model (LLM) inference by proposing PIM-AI, a novel processing-in-memory architecture integrated into DDR5/LPDDR5 memory. A PyTorch-based hardware simulator evaluates PIM-AI against state-of-the-art GPU and mobile SoC baselines across cloud and mobile scenarios, demonstrating up to 6.94x lower 3-year TCO per QPS in the cloud and 10x–20x reductions in energy per token on mobile, with 25–45% higher queries per second. The core contributions include the PIM-AI chip and DIMM designs, a scalable inter-chip data sharing mechanism, and a versatile simulator that maps model operations to hardware profiles. The findings suggest PIM-AI can significantly improve the efficiency, cost-effectiveness, and sustainability of wide-scale LLM deployments, motivating further prototype validation and exploration of heterogeneous integrations.

Abstract

Large Language Models (LLMs) have become essential in a variety of applications due to their advanced language understanding and generation capabilities. However, their computational and memory requirements pose significant challenges to traditional hardware architectures. Processing-in-Memory (PIM), which integrates computational units directly into memory chips, offers several advantages for LLM inference, including reduced data transfer bottlenecks and improved power efficiency. This paper introduces PIM-AI, a novel DDR5/LPDDR5 PIM architecture designed for LLM inference without modifying the memory controller or DDR/LPDDR memory PHY. We have developed a simulator to evaluate the performance of PIM-AI in various scenarios and demonstrate its significant advantages over conventional architectures. In cloud-based scenarios, PIM-AI reduces the 3-year TCO per queries-per-second by up to 6.94x compared to state-of-the-art GPUs, depending on the LLM model used. In mobile scenarios, PIM-AI achieves a 10- to 20-fold reduction in energy per token compared to state-of-the-art mobile SoCs, resulting in 25 to 45~\% more queries per second and 6.9x to 13.4x less energy per query, extending battery life and enabling more inferences per charge. These results highlight PIM-AI's potential to revolutionize LLM deployments, making them more efficient, scalable, and sustainable.

PIM-AI: A Novel Architecture for High-Efficiency LLM Inference

TL;DR

This work tackles the data-transfer bottlenecks and energy demands of large language model (LLM) inference by proposing PIM-AI, a novel processing-in-memory architecture integrated into DDR5/LPDDR5 memory. A PyTorch-based hardware simulator evaluates PIM-AI against state-of-the-art GPU and mobile SoC baselines across cloud and mobile scenarios, demonstrating up to 6.94x lower 3-year TCO per QPS in the cloud and 10x–20x reductions in energy per token on mobile, with 25–45% higher queries per second. The core contributions include the PIM-AI chip and DIMM designs, a scalable inter-chip data sharing mechanism, and a versatile simulator that maps model operations to hardware profiles. The findings suggest PIM-AI can significantly improve the efficiency, cost-effectiveness, and sustainability of wide-scale LLM deployments, motivating further prototype validation and exploration of heterogeneous integrations.

Abstract

Large Language Models (LLMs) have become essential in a variety of applications due to their advanced language understanding and generation capabilities. However, their computational and memory requirements pose significant challenges to traditional hardware architectures. Processing-in-Memory (PIM), which integrates computational units directly into memory chips, offers several advantages for LLM inference, including reduced data transfer bottlenecks and improved power efficiency. This paper introduces PIM-AI, a novel DDR5/LPDDR5 PIM architecture designed for LLM inference without modifying the memory controller or DDR/LPDDR memory PHY. We have developed a simulator to evaluate the performance of PIM-AI in various scenarios and demonstrate its significant advantages over conventional architectures. In cloud-based scenarios, PIM-AI reduces the 3-year TCO per queries-per-second by up to 6.94x compared to state-of-the-art GPUs, depending on the LLM model used. In mobile scenarios, PIM-AI achieves a 10- to 20-fold reduction in energy per token compared to state-of-the-art mobile SoCs, resulting in 25 to 45~\% more queries per second and 6.9x to 13.4x less energy per query, extending battery life and enabling more inferences per charge. These results highlight PIM-AI's potential to revolutionize LLM deployments, making them more efficient, scalable, and sustainable.

Paper Structure

This paper contains 32 sections, 5 figures, 1 table.

Figures (5)

  • Figure 1: Simplified architecture of transformer-based LLMs, showing the common structure used in both encoding and decoding phases.
  • Figure 2: PIM-AI chip architecture: The logic die houses 4 RISC V processors, each with tensor and vector units, accessing DRAM banks through DRAM-logic connections. The memory PHY remains unchanged.
  • Figure 3: Execution flow of the LLM hardware simulator. The LLM hardware simulator overrides the functions and layers of the PyTorch library.
  • Figure 4: Comparative performance of one DGX-H100 server and four PIM-AI servers. (a) Time to first token and (b) Energy consumption during the encoding phase; (c) Tokens per second and (d) Energy per token during the decoding phase; (e) Queries per Second and (f) Energy per Query as Overall Performance.
  • Figure 5: Comparative Performance of PIM-AI on Mobile Scenario (a) Time to First Token and (b) Energy consumption during encoding phase; (c) Tokens per second and (d) Energy per token during decoding phase; (e) Queries per Second and (f) Energy per Query as Overall Performance. Each sub-figure shows PIM-AI gains over A17 Pro, Snapdragon 8 Gen3, and Dimensity 9300.