Table of Contents
Fetching ...

HALO: Memory-Centric Heterogeneous Accelerator with 2.5D Integration for Low-Batch LLM Inference

Shubham Negi, Kaushik Roy

TL;DR

HALO presents a memory-centric heterogeneous accelerator that integrates compute-in-DRAM (CiD) with an on-chip analog compute-in-memory (CiM) block using 2.5D packaging to accelerate low-batch, long-context LLM inference. A phase-aware mapping routes compute-bound prefill GEMMs to CiM and memory-bound decode GEMVs to CiD, with non-GEMM operations handled by vector units, achieving efficient utilization across phases. Evaluations on LLaMA-2 7B and Qwen3 8B show HALO attaining up to 18x end-to-end speedup over AttAcc and 2.5x over CENT, along with notable energy improvements, while analyzing trade-offs between fully CiD and fully CiM baselines. The results underscore the practical value of memory-centric heterogeneity and 2.5D integration for interactive LLm applications with long context lengths.

Abstract

The rapid adoption of Large Language Models (LLMs) has driven a growing demand for efficient inference, particularly in latency-sensitive applications such as chatbots and personalized assistants. Unlike traditional deep neural networks, LLM inference proceeds in two distinct phases: the prefill phase, which processes the full input sequence in parallel, and the decode phase, which generates tokens sequentially. These phases exhibit highly diverse compute and memory requirements, which makes accelerator design particularly challenging. Prior works have primarily been optimized for high-batch inference or evaluated only short input context lengths, leaving the low-batch and long context regime, which is critical for interactive applications, largely underexplored. We propose HALO, a heterogeneous memory centric accelerator designed for these unique challenges of prefill and decode phases in low-batch LLM inference. HALO integrates HBM based Compute-in-DRAM (CiD) with an on-chip analog Compute-in-Memory (CiM), co-packaged using 2.5D integration. To further improve the hardware utilization, we introduce a phase-aware mapping strategy that adapts to the distinct demands of the prefill and decode phases. Compute bound operations in the prefill phase are mapped to CiM to exploit its high throughput matrix multiplication capability, while memory-bound operations in the decode phase are executed on CiD to benefit from reduced data movement within DRAM. Additionally, we present an analysis of the performance tradeoffs of LLMs under two architectural extremes: a fully CiD and a fully on-chip analog CiM design to highlight the need for a heterogeneous design. We evaluate HALO on LLaMA-2 7B and Qwen3 8B models. Our experimental results show that LLMs mapped to HALO achieve up to 18x geometric mean speedup over AttAcc, an attention-optimized mapping and 2.5x over CENT, a fully CiD based mapping.

HALO: Memory-Centric Heterogeneous Accelerator with 2.5D Integration for Low-Batch LLM Inference

TL;DR

HALO presents a memory-centric heterogeneous accelerator that integrates compute-in-DRAM (CiD) with an on-chip analog compute-in-memory (CiM) block using 2.5D packaging to accelerate low-batch, long-context LLM inference. A phase-aware mapping routes compute-bound prefill GEMMs to CiM and memory-bound decode GEMVs to CiD, with non-GEMM operations handled by vector units, achieving efficient utilization across phases. Evaluations on LLaMA-2 7B and Qwen3 8B show HALO attaining up to 18x end-to-end speedup over AttAcc and 2.5x over CENT, along with notable energy improvements, while analyzing trade-offs between fully CiD and fully CiM baselines. The results underscore the practical value of memory-centric heterogeneity and 2.5D integration for interactive LLm applications with long context lengths.

Abstract

The rapid adoption of Large Language Models (LLMs) has driven a growing demand for efficient inference, particularly in latency-sensitive applications such as chatbots and personalized assistants. Unlike traditional deep neural networks, LLM inference proceeds in two distinct phases: the prefill phase, which processes the full input sequence in parallel, and the decode phase, which generates tokens sequentially. These phases exhibit highly diverse compute and memory requirements, which makes accelerator design particularly challenging. Prior works have primarily been optimized for high-batch inference or evaluated only short input context lengths, leaving the low-batch and long context regime, which is critical for interactive applications, largely underexplored. We propose HALO, a heterogeneous memory centric accelerator designed for these unique challenges of prefill and decode phases in low-batch LLM inference. HALO integrates HBM based Compute-in-DRAM (CiD) with an on-chip analog Compute-in-Memory (CiM), co-packaged using 2.5D integration. To further improve the hardware utilization, we introduce a phase-aware mapping strategy that adapts to the distinct demands of the prefill and decode phases. Compute bound operations in the prefill phase are mapped to CiM to exploit its high throughput matrix multiplication capability, while memory-bound operations in the decode phase are executed on CiD to benefit from reduced data movement within DRAM. Additionally, we present an analysis of the performance tradeoffs of LLMs under two architectural extremes: a fully CiD and a fully on-chip analog CiM design to highlight the need for a heterogeneous design. We evaluate HALO on LLaMA-2 7B and Qwen3 8B models. Our experimental results show that LLMs mapped to HALO achieve up to 18x geometric mean speedup over AttAcc, an attention-optimized mapping and 2.5x over CENT, a fully CiD based mapping.

Paper Structure

This paper contains 13 sections, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Roofline plot of the CiM accelerator (Table \ref{['spec']}) with general matrix-matrix multiplication (GEMM) operations ($L_{in}$=512) from the LLaMA-2 7B model mapped during prefill and decode phases for batch size (BS) 1 and 16, respectively. Prefill GEMMs generally achieve higher arithmetic intensity and approach the compute bound region, while decode GEMMs, especially batch size 1, are memory bound and limited by bandwidth.
  • Figure 2: (a) Prefill phase of the LLM inference, where each decoder block consists of sub-operations such as LayerNorm, QKV generation, attention, projection and feedforward layers. (b) Decode phase of the LLM inference, which generates one token at a time and reuses cached Key-Value (KV) states.
  • Figure 3: (a) Overview of the proposed 2.5D integrated heterogeneous accelerator architecture (HALO). The system integrates compute units within the HBM3 stack to accelerate GEMV operations, and analog compute-in-memory (CiM) accelerator co-packaged on the interposer to accelerate GEMM operations. The vector units are added to the logic die to perform non-GEMM operations. (b) Details of the GEMV units in CiD architecture. (c) Analog CiM array based on 8T SRAM cells. (d) Details of the vector units in the logic die.
  • Figure 4: Execution time breakdown of different operations in the LLaMA-2 7B for prefill and decode phases with $L_{in}$=2048, $L_{out}$=128 and batch size=1.
  • Figure 5: (a) TTFT and (b) Prefill phase energy for LLaMA-2 7B model under varying input context lengths, when mapped to fully CiD and fully CiM accelerator architecture.
  • ...and 5 more figures