Table of Contents
Fetching ...

All-rounder: A Flexible AI Accelerator with Diverse Data Format Support and Morphable Structure for Multi-DNN Processing

Seock-Hwan Noh, Seungpyo Lee, Banseok Shin, Sehun Park, Yongjoo Jang, Jaeha Kung

TL;DR

The paper tackles the challenge of building a datacenter AI accelerator that can efficiently support diverse data formats and mixed-precision operations for both inference and training. It introduces All-rounder, featuring an area-efficient all-in-one multiplier and a morphable MAC array that can fuse or split to maximize MAC utilization across multi-tenant workloads, along with a customized RISC-V ISA for control. Extensive hardware-level evaluation shows significant area and energy savings relative to baselines and strong MAC utilization across CNNs and LLMs, with competitive performance versus baseline accelerators and substantially better energy efficiency than a high-end GPU. The work demonstrates a practical path to scalable, flexible, and power-efficient cloud AI services by enabling diverse formats and operation shapes without proliferating dedicated hardware for each format.

Abstract

Recognizing the explosive increase in the use of AI-based applications, several industrial companies developed custom ASICs (e.g., Google TPU, IBM RaPiD, Intel NNP-I/NNP-T) and constructed a hyperscale cloud infrastructure with them. These ASICs perform operations of the inference or training process of AI models which are requested by users. Since the AI models have different data formats and types of operations, the ASICs need to support diverse data formats and various operation shapes. However, the previous ASIC solutions do not or less fulfill these requirements. To overcome these limitations, we first present an area-efficient multiplier, named all-in-one multiplier, that supports multiple bit-widths for both integer and floating point data types. Then, we build a MAC array equipped with these multipliers with multi-format support. In addition, the MAC array can be partitioned into multiple blocks that can be flexibly fused to support various DNN operation types. We evaluate the practical effectiveness of the proposed MAC array by making an accelerator out of it, named All-rounder. According to our evaluation, the proposed all-in-one multiplier occupies 1.49x smaller area compared to the baselines with dedicated multipliers for each data format. Then, we compare the performance and energy efficiency of the proposed All-rounder with three different accelerators showing consistent speedup and higher efficiency across various AI benchmarks from vision to LLM-based language tasks.

All-rounder: A Flexible AI Accelerator with Diverse Data Format Support and Morphable Structure for Multi-DNN Processing

TL;DR

The paper tackles the challenge of building a datacenter AI accelerator that can efficiently support diverse data formats and mixed-precision operations for both inference and training. It introduces All-rounder, featuring an area-efficient all-in-one multiplier and a morphable MAC array that can fuse or split to maximize MAC utilization across multi-tenant workloads, along with a customized RISC-V ISA for control. Extensive hardware-level evaluation shows significant area and energy savings relative to baselines and strong MAC utilization across CNNs and LLMs, with competitive performance versus baseline accelerators and substantially better energy efficiency than a high-end GPU. The work demonstrates a practical path to scalable, flexible, and power-efficient cloud AI services by enabling diverse formats and operation shapes without proliferating dedicated hardware for each format.

Abstract

Recognizing the explosive increase in the use of AI-based applications, several industrial companies developed custom ASICs (e.g., Google TPU, IBM RaPiD, Intel NNP-I/NNP-T) and constructed a hyperscale cloud infrastructure with them. These ASICs perform operations of the inference or training process of AI models which are requested by users. Since the AI models have different data formats and types of operations, the ASICs need to support diverse data formats and various operation shapes. However, the previous ASIC solutions do not or less fulfill these requirements. To overcome these limitations, we first present an area-efficient multiplier, named all-in-one multiplier, that supports multiple bit-widths for both integer and floating point data types. Then, we build a MAC array equipped with these multipliers with multi-format support. In addition, the MAC array can be partitioned into multiple blocks that can be flexibly fused to support various DNN operation types. We evaluate the practical effectiveness of the proposed MAC array by making an accelerator out of it, named All-rounder. According to our evaluation, the proposed all-in-one multiplier occupies 1.49x smaller area compared to the baselines with dedicated multipliers for each data format. Then, we compare the performance and energy efficiency of the proposed All-rounder with three different accelerators showing consistent speedup and higher efficiency across various AI benchmarks from vision to LLM-based language tasks.
Paper Structure (18 sections, 1 equation, 15 figures, 4 tables)

This paper contains 18 sections, 1 equation, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Synthesized results of multipliers for various data types and precisions. They are synthesized in 28nm CMOS technology using Synopsys DesignWare IPs dw_synopsys. (a), (b) and (c) show the area, power consumption, and maximum operating frequency of each multiplier normalized by INT32 counterpart, respectively.
  • Figure 2: Data mappings on a 4$\times$4 systolic array (SA) for convolution operations commonly used in AI vision tasks. (a) Convolutions where outputs of MAC operations are accumulated across the input channels, and (b) convolutions where results of MAC operations are not accumulated across the input channels.
  • Figure 3: Example of data mapping of multiple AI workloads, i.e., two GEMMs, from two natural language processing models, on the 4$\times$4 systolic array.
  • Figure 4: (a) Structure of a floating point multiplier. (b) Area breakdown of FP8 multiplier. (c) Area breakdown of bfloat16 multiplier.
  • Figure 5: Examples on how the restructured CSM works for some data format combinations: (a) INT8 $\times$ INT8, (b) INT4 $\times$ INT4, (c) bfloat16 $\times$ bfloat16, (d) FP8 $\times$ FP8.
  • ...and 10 more figures