Table of Contents
Fetching ...

Flex-PE: Flexible and SIMD Multi-Precision Processing Element for AI Workloads

Mukul Lokhande, Gopal Raut, Santosh Kumar Vishvakarma

TL;DR

Flex-PE tackles the need for runtime-precision configurable activation functions combined with SIMD multi-precision MAC to support diverse AI workloads ranging from edge inference to HPC. Using a CORDIC-based processing element, it provides runtime-configurable activations (Sigmoid, Tanh, ReLU, Softmax) across FxP4/8/16/32 and supports iterative and pipelined operating modes to trade latency for area. It achieves throughput gains of up to 16×, 8×, 4×, and 1× for FxP4/8/16/32 respectively, with 100% time-multiplexed hardware, and attains 8.42 GOPS/W energy efficiency with less than 2% accuracy loss, along with substantial DMA-read reductions for VGG-16. The Flex-PE supports edge and cloud AI workloads, providing a flexible foundation for scalable, energy-efficient accelerators, with future work including extending the activation-function set and scaling to larger arrays.

Abstract

The rapid adaptation of data driven AI models, such as deep learning inference, training, Vision Transformers (ViTs), and other HPC applications, drives a strong need for runtime precision configurable different non linear activation functions (AF) hardware support. Existing solutions support diverse precision or runtime AF reconfigurability but fail to address both simultaneously. This work proposes a flexible and SIMD multiprecision processing element (FlexPE), which supports diverse runtime configurable AFs, including sigmoid, tanh, ReLU and softmax, and MAC operation. The proposed design achieves an improved throughput of up to 16X FxP4, 8X FxP8, 4X FxP16 and 1X FxP32 in pipeline mode with 100% time multiplexed hardware. This work proposes an area efficient multiprecision iterative mode in the SIMD systolic arrays for edge AI use cases. The design delivers superior performance with up to 62X and 371X reductions in DMA reads for input feature maps and weight filters in VGG16, with an energy efficiency of 8.42 GOPS / W within the accuracy loss of 2%. The proposed architecture supports emerging 4-bit computations for DL inference while enhancing throughput in FxP8/16 modes for transformers and other HPC applications. The proposed approach enables future energy-efficient AI accelerators in edge and cloud environments.

Flex-PE: Flexible and SIMD Multi-Precision Processing Element for AI Workloads

TL;DR

Flex-PE tackles the need for runtime-precision configurable activation functions combined with SIMD multi-precision MAC to support diverse AI workloads ranging from edge inference to HPC. Using a CORDIC-based processing element, it provides runtime-configurable activations (Sigmoid, Tanh, ReLU, Softmax) across FxP4/8/16/32 and supports iterative and pipelined operating modes to trade latency for area. It achieves throughput gains of up to 16×, 8×, 4×, and 1× for FxP4/8/16/32 respectively, with 100% time-multiplexed hardware, and attains 8.42 GOPS/W energy efficiency with less than 2% accuracy loss, along with substantial DMA-read reductions for VGG-16. The Flex-PE supports edge and cloud AI workloads, providing a flexible foundation for scalable, energy-efficient accelerators, with future work including extending the activation-function set and scaling to larger arrays.

Abstract

The rapid adaptation of data driven AI models, such as deep learning inference, training, Vision Transformers (ViTs), and other HPC applications, drives a strong need for runtime precision configurable different non linear activation functions (AF) hardware support. Existing solutions support diverse precision or runtime AF reconfigurability but fail to address both simultaneously. This work proposes a flexible and SIMD multiprecision processing element (FlexPE), which supports diverse runtime configurable AFs, including sigmoid, tanh, ReLU and softmax, and MAC operation. The proposed design achieves an improved throughput of up to 16X FxP4, 8X FxP8, 4X FxP16 and 1X FxP32 in pipeline mode with 100% time multiplexed hardware. This work proposes an area efficient multiprecision iterative mode in the SIMD systolic arrays for edge AI use cases. The design delivers superior performance with up to 62X and 371X reductions in DMA reads for input feature maps and weight filters in VGG16, with an energy efficiency of 8.42 GOPS / W within the accuracy loss of 2%. The proposed architecture supports emerging 4-bit computations for DL inference while enhancing throughput in FxP8/16 modes for transformers and other HPC applications. The proposed approach enables future energy-efficient AI accelerators in edge and cloud environments.

Paper Structure

This paper contains 16 sections, 4 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: (a) Typical Deep Neural Network (DNN) model showcasing various layers including Conv, Pooling, and FC. (b) AI SoC featuring a RISC-V-enabled Systolic Array with detailed PE architecture.
  • Figure 2: Workload analysisTAI24-CORDIC-RNN emphasizing on the growing demand for performance-enhanced non-linear activation functions.
  • Figure 3: Pareto Evaluation for error metrics with proposed config-AF (softmax, sigmoid, tanh) with different LR and HV CORDIC stages for Flex-PE.
  • Figure 4: (a) Proposed SIMD FxP4/8/16/32 Configurable AF (Sigmoid, Tanh, ReLU, Softmax), (b) Detailed internal circuitry showcasing 5-stage SIMD Logarithmic barrel shifter and configurable Add_Sub circuit design.
  • Figure 5: Evaluation of DNN accuracy showcasing effects of precision scalability on CORDIC-based SIMD processing engine (Flex-PE).
  • ...and 1 more figures