Table of Contents
Fetching ...

Hemlet: A Heterogeneous Compute-in-Memory Chiplet Architecture for Vision Transformers with Group-Level Parallelism

Cong Wang, Zexin Fu, Jiayi Huang, Shanshi Huang

TL;DR

Vision Transformers impose heavy memory and compute demands, creating a memory-wall challenge for hardware deployment. Hemlet introduces a heterogeneous CIM chiplet system integrating ACIM, DCIM, and IDP, coupled with a group-level parallelism (GLP) mapping and system-level dataflow to reduce inter-chiplet communication. The approach yields substantial throughput gains (up to 4.47x) and 9.24 TOPS, 4.98 TOPS/W across ViT workloads, demonstrating scalable, modular acceleration using NoP-connected chiplets. By addressing both static and dynamic VMM needs while maintaining accuracy, Hemlet offers a practical path toward high-performance ViT inference on scalable hardware.

Abstract

Vision Transformers (ViTs) have established new performance benchmarks in vision tasks such as image recognition and object detection. However, these advancements come with significant demands for memory and computational resources, presenting challenges for hardware deployment. Heterogeneous compute-in-memory (CIM) accelerators have emerged as a promising solution for enabling energy-efficient deployment of ViTs. Despite this potential, monolithic CIM-based designs face scalability issues due to the size limitations of a single chip. To address this challenge, emerging chiplet-based techniques offer a more scalable alternative. However, chiplet designs come with their own costs, as they introduce expensive communication, which can hinder improvements in throughput. This work introduces Hemlet, a heterogeneous CIM chiplet system designed to accelerate ViT workloads. Hemlet enables flexible resource scaling through the integration of heterogeneous analog CIM (ACIM), digital CIM (DCIM), and Intermediate Data Process (IDP) chiplets. To improve throughput while reducing communication overhead, it employs a group-level parallelism (GLP) mapping strategy and system-level dataflow optimization, achieving speedups ranging from 1.89x to 4.47x across various hardware configurations within the chiplet system. Our evaluation results show that Hemlet can reach a throughput of 9.24 TOPS with an energy efficiency of 4.98 TOPS/W.

Hemlet: A Heterogeneous Compute-in-Memory Chiplet Architecture for Vision Transformers with Group-Level Parallelism

TL;DR

Vision Transformers impose heavy memory and compute demands, creating a memory-wall challenge for hardware deployment. Hemlet introduces a heterogeneous CIM chiplet system integrating ACIM, DCIM, and IDP, coupled with a group-level parallelism (GLP) mapping and system-level dataflow to reduce inter-chiplet communication. The approach yields substantial throughput gains (up to 4.47x) and 9.24 TOPS, 4.98 TOPS/W across ViT workloads, demonstrating scalable, modular acceleration using NoP-connected chiplets. By addressing both static and dynamic VMM needs while maintaining accuracy, Hemlet offers a practical path toward high-performance ViT inference on scalable hardware.

Abstract

Vision Transformers (ViTs) have established new performance benchmarks in vision tasks such as image recognition and object detection. However, these advancements come with significant demands for memory and computational resources, presenting challenges for hardware deployment. Heterogeneous compute-in-memory (CIM) accelerators have emerged as a promising solution for enabling energy-efficient deployment of ViTs. Despite this potential, monolithic CIM-based designs face scalability issues due to the size limitations of a single chip. To address this challenge, emerging chiplet-based techniques offer a more scalable alternative. However, chiplet designs come with their own costs, as they introduce expensive communication, which can hinder improvements in throughput. This work introduces Hemlet, a heterogeneous CIM chiplet system designed to accelerate ViT workloads. Hemlet enables flexible resource scaling through the integration of heterogeneous analog CIM (ACIM), digital CIM (DCIM), and Intermediate Data Process (IDP) chiplets. To improve throughput while reducing communication overhead, it employs a group-level parallelism (GLP) mapping strategy and system-level dataflow optimization, achieving speedups ranging from 1.89x to 4.47x across various hardware configurations within the chiplet system. Our evaluation results show that Hemlet can reach a throughput of 9.24 TOPS with an energy efficiency of 4.98 TOPS/W.

Paper Structure

This paper contains 15 sections, 3 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Architecture of (a) Analog CIM and (b) Digital CIM
  • Figure 2: Heterogeneous CIM for Transformer Computation
  • Figure 3: Comparison of monolithic and chiplet-based heterogeneous CIM architectures
  • Figure 4: (a) Overall system architecture integrating ACIM, DCIM, and IDP chiplets interconnected through a network-on-package (NoP); (b) Internal structure of an ACIM chiplet consisting of multiple ACIM PEs, a local buffer, and NoP router and TX/RX; (c) architecture of a processing engine (PE) with multiple subarrays (SAs), the adder tree, and input/output buffers; (d) Architecture of a DCIM chiplet comprising multiple DCIM PEs. The chiplet also integrates a SIMD unit, a control FSM, a chiplet buffer, and a NoP communication module (router and TX/RX) to support inter-chiplet data movement. (e) Design of an Intermediate Data Process (IDP) chiplet equipped with SRAM banks, a SIMD unit, and NoP communication modules.
  • Figure 5: Motivation for group-level parallelism (GLP). (a) Typical CIM subarray with MUXs and shared ADCs; (b) A group of columns sharing one ADC and MUX is defined as a “Group”; (c) Time-multiplexed column access severely limits throughput, activating only one column per group per cycle; (d) system-wide ADC under-utilization under the layer-wise mapping method.
  • ...and 5 more figures