Table of Contents
Fetching ...

StreamDCIM: A Tile-based Streaming Digital CIM Accelerator with Mixed-stationary Cross-forwarding Dataflow for Multimodal Transformer

Shantian Qin, Ziqing Qiang, Zhihua Fan, Wenming Li, Xuejun An, Xiaochun Ye, Dongrui Fan

TL;DR

Multimodal Transformers impose high compute and memory demands, challenging traditional digital CIM accelerators due to microarchitecture, dataflow, and pipeline inflexibilities. StreamDCIM addresses these issues with a tile-based streaming CIM design that features a tile-based reconfigurable CIM macro (TBR-CIM), a mixed-stationary cross-forwarding dataflow for tile-level parallelism, and a ping-pong fine-grained compute-rewriting pipeline to overlap CIM rewriting with computation. The key contributions are (1) the TBR-CIM macro with normal and hybrid modes, (2) the tile-based cross-forwarding dataflow enabling elastic single-macro scheduling, (3) the fine-grained compute-rewriting pipeline, and (4) a hardware demonstration showing substantial speedups and energy efficiency on ViLBERT-based multimodal models. The results suggest that tile-based streaming CIM with dataflow and rewriting optimization can significantly improve throughput and energy efficiency for on-chip multimodal Transformer inference, enabling more practical CIM-based deployment in multimodal AI tasks. $2.63\times$ speedup, $1.28\times$, and $2.26\times$ energy savings are reported geomean on typical models compared to non-streaming and layer-based CIM baselines.

Abstract

Multimodal Transformers are emerging artificial intelligence (AI) models designed to process a mixture of signals from diverse modalities. Digital computing-in-memory (CIM) architectures are considered promising for achieving high efficiency while maintaining high accuracy. However, current digital CIM-based accelerators exhibit inflexibility in microarchitecture, dataflow, and pipeline to effectively accelerate multimodal Transformer. In this paper, we propose StreamDCIM, a tile-based streaming digital CIM accelerator for multimodal Transformers. It overcomes the above challenges with three features: First, we present a tile-based reconfigurable CIM macro microarchitecture with normal and hybrid reconfigurable modes to improve intra-macro CIM utilization. Second, we implement a mixed-stationary cross-forwarding dataflow with tile-based execution decoupling to exploit tile-level computation parallelism. Third, we introduce a ping-pong-like fine-grained compute-rewriting pipeline to overlap high-latency on-chip CIM rewriting. Experimental results show that StreamDCIM outperforms non-streaming and layer-based streaming CIM-based solutions by geomean 2.63$\times$ and 1.28$\times$ on typical multimodal Transformer models.

StreamDCIM: A Tile-based Streaming Digital CIM Accelerator with Mixed-stationary Cross-forwarding Dataflow for Multimodal Transformer

TL;DR

Multimodal Transformers impose high compute and memory demands, challenging traditional digital CIM accelerators due to microarchitecture, dataflow, and pipeline inflexibilities. StreamDCIM addresses these issues with a tile-based streaming CIM design that features a tile-based reconfigurable CIM macro (TBR-CIM), a mixed-stationary cross-forwarding dataflow for tile-level parallelism, and a ping-pong fine-grained compute-rewriting pipeline to overlap CIM rewriting with computation. The key contributions are (1) the TBR-CIM macro with normal and hybrid modes, (2) the tile-based cross-forwarding dataflow enabling elastic single-macro scheduling, (3) the fine-grained compute-rewriting pipeline, and (4) a hardware demonstration showing substantial speedups and energy efficiency on ViLBERT-based multimodal models. The results suggest that tile-based streaming CIM with dataflow and rewriting optimization can significantly improve throughput and energy efficiency for on-chip multimodal Transformer inference, enabling more practical CIM-based deployment in multimodal AI tasks. speedup, , and energy savings are reported geomean on typical models compared to non-streaming and layer-based CIM baselines.

Abstract

Multimodal Transformers are emerging artificial intelligence (AI) models designed to process a mixture of signals from diverse modalities. Digital computing-in-memory (CIM) architectures are considered promising for achieving high efficiency while maintaining high accuracy. However, current digital CIM-based accelerators exhibit inflexibility in microarchitecture, dataflow, and pipeline to effectively accelerate multimodal Transformer. In this paper, we propose StreamDCIM, a tile-based streaming digital CIM accelerator for multimodal Transformers. It overcomes the above challenges with three features: First, we present a tile-based reconfigurable CIM macro microarchitecture with normal and hybrid reconfigurable modes to improve intra-macro CIM utilization. Second, we implement a mixed-stationary cross-forwarding dataflow with tile-based execution decoupling to exploit tile-level computation parallelism. Third, we introduce a ping-pong-like fine-grained compute-rewriting pipeline to overlap high-latency on-chip CIM rewriting. Experimental results show that StreamDCIM outperforms non-streaming and layer-based streaming CIM-based solutions by geomean 2.63 and 1.28 on typical multimodal Transformer models.

Paper Structure

This paper contains 10 sections, 7 figures.

Figures (7)

  • Figure 1: Overview of Different NN Accelerator Architectures.
  • Figure 2: Challenges for CIM-based Multimodal Transformer Acceleration: 1) Microarchitecture Inflexibility. 2) Dataflow Inflexibility. 3) Pipeline Inflexibility.
  • Figure 3: StreamDCIM: (a) Overall Architecture. (b) TBR-CIM Macro Microarchitecture.
  • Figure 4: (a) Mixed-stationary Cross-forwarding Dataflow & (b) Fine-grained Compute-Rewriting Pipeline (Example of the Stream for Modal X).
  • Figure 5: StreamDCIM: (a) Area Breakdown. (b) Power Breakdown.
  • ...and 2 more figures