Table of Contents
Fetching ...

FETTA: Flexible and Efficient Hardware Accelerator for Tensorized Neural Network Training

Jinming Lu, Jiayi Tian, Hai Li, Ian Young, Zheng Zhang

TL;DR

This work tackles the challenge of on-device training for tensorized neural networks by co-designing an algorithm (CSSE) and a hardware architecture (FETTA). CSSE expands the contraction-sequence search space and uses a two-stage cost model to identify hardware-friendly sequences, while FETTA employs a hierarchical transposable CE array and butterfly-based distribution/reduction networks to support flexible dataflows across FP, BP, and WG phases. The combination yields large improvements in latency and energy efficiency over GPUs, TPUs, and prior tensorized-training accelerators, validating the value of algorithm-hardware co-optimization for edge-friendly DNN training. Practically, FETTA enables efficient, privacy-preserving on-device learning with substantial speedups (up to $20.5 imes$ over GPU and $100.9 imes$ over TPU) and energy savings (up to $567.5 imes$ over GPU and $45.0 imes$ over TPU).

Abstract

The increasing demand for on-device training of deep neural networks (DNNs) aims to leverage personal data for high-performance applications while addressing privacy concerns and reducing communication latency. However, resource-constrained platforms face significant challenges due to the intensive computational and memory demands of DNN training. Tensor decomposition emerges as a promising approach to compress model size without sacrificing accuracy. Nevertheless, training tensorized neural networks (TNNs) incurs non-trivial overhead and severe performance degradation on conventional accelerators due to complex tensor shaping requirements. To address these challenges, we propose FETTA, an algorithm and hardware co-optimization framework for efficient TNN training. On the algorithm side, we develop a contraction sequence search engine (CSSE) to identify the optimal contraction sequence with the minimal computational overhead. On the hardware side, FETTA features a flexible and efficient architecture equipped with a reconfigurable contraction engine (CE) array to support diverse dataflows. Furthermore, butterfly-based distribution and reduction networks are implemented to perform flexible tensor shaping operations during computation. Evaluation results demonstrate that FETTA achieves reductions of 20.5x/100.9x, 567.5x/45.03x, and 11609.7x/4544.8x in terms of processing latency, energy, and energy-delay product (EDP) over GPU and TPU, respectively. Moreover, working on the tensorized training, FETTA outperforms prior accelerators with a speedup of 3.87~14.63x, and an energy efficiency improvement of 1.41~2.73x on average.

FETTA: Flexible and Efficient Hardware Accelerator for Tensorized Neural Network Training

TL;DR

This work tackles the challenge of on-device training for tensorized neural networks by co-designing an algorithm (CSSE) and a hardware architecture (FETTA). CSSE expands the contraction-sequence search space and uses a two-stage cost model to identify hardware-friendly sequences, while FETTA employs a hierarchical transposable CE array and butterfly-based distribution/reduction networks to support flexible dataflows across FP, BP, and WG phases. The combination yields large improvements in latency and energy efficiency over GPUs, TPUs, and prior tensorized-training accelerators, validating the value of algorithm-hardware co-optimization for edge-friendly DNN training. Practically, FETTA enables efficient, privacy-preserving on-device learning with substantial speedups (up to over GPU and over TPU) and energy savings (up to over GPU and over TPU).

Abstract

The increasing demand for on-device training of deep neural networks (DNNs) aims to leverage personal data for high-performance applications while addressing privacy concerns and reducing communication latency. However, resource-constrained platforms face significant challenges due to the intensive computational and memory demands of DNN training. Tensor decomposition emerges as a promising approach to compress model size without sacrificing accuracy. Nevertheless, training tensorized neural networks (TNNs) incurs non-trivial overhead and severe performance degradation on conventional accelerators due to complex tensor shaping requirements. To address these challenges, we propose FETTA, an algorithm and hardware co-optimization framework for efficient TNN training. On the algorithm side, we develop a contraction sequence search engine (CSSE) to identify the optimal contraction sequence with the minimal computational overhead. On the hardware side, FETTA features a flexible and efficient architecture equipped with a reconfigurable contraction engine (CE) array to support diverse dataflows. Furthermore, butterfly-based distribution and reduction networks are implemented to perform flexible tensor shaping operations during computation. Evaluation results demonstrate that FETTA achieves reductions of 20.5x/100.9x, 567.5x/45.03x, and 11609.7x/4544.8x in terms of processing latency, energy, and energy-delay product (EDP) over GPU and TPU, respectively. Moreover, working on the tensorized training, FETTA outperforms prior accelerators with a speedup of 3.87~14.63x, and an energy efficiency improvement of 1.41~2.73x on average.

Paper Structure

This paper contains 39 sections, 6 equations, 16 figures, 3 tables, 1 algorithm.

Figures (16)

  • Figure 1: Tensor network diagrams illustrating (a) a 3rd-order tensor node, (b) a tensor contraction of the matrix multiplication, (c) a multi-node tensor network.
  • Figure 2: Tensor network diagrams for (a) Tensor Train, (b) Tensor Train Matrix, (c) Tensor Ring, (d) Block Term, (e) Hierarchical Tucker. In each graph, $\mathbfcal{X} \in \mathbb{R}^{B \times N_1 \times N_2 \times N_3 \times N_4}$, $\mathbfcal{W} \in \mathbb{R}^{M_1 \times M_2 \times M_3 \times M_4 \times N_1 \times N_2 \times N_3 \times N_4}$, and $\mathbfcal{Y} \in \mathbb{R}^{B \times M_1 \times M_2 \times M_3 \times M_4}$
  • Figure 3: Processing flow diagram for DNN training on a general accelerator.
  • Figure 4: Example computing schemes for a TT layer. Weight nodes are denoted with index for simplicity.
  • Figure 5: Training Performance Profile on GPU.
  • ...and 11 more figures