FETTA: Flexible and Efficient Hardware Accelerator for Tensorized Neural Network Training

Jinming Lu; Jiayi Tian; Hai Li; Ian Young; Zheng Zhang

FETTA: Flexible and Efficient Hardware Accelerator for Tensorized Neural Network Training

Jinming Lu, Jiayi Tian, Hai Li, Ian Young, Zheng Zhang

TL;DR

This work tackles the challenge of on-device training for tensorized neural networks by co-designing an algorithm (CSSE) and a hardware architecture (FETTA). CSSE expands the contraction-sequence search space and uses a two-stage cost model to identify hardware-friendly sequences, while FETTA employs a hierarchical transposable CE array and butterfly-based distribution/reduction networks to support flexible dataflows across FP, BP, and WG phases. The combination yields large improvements in latency and energy efficiency over GPUs, TPUs, and prior tensorized-training accelerators, validating the value of algorithm-hardware co-optimization for edge-friendly DNN training. Practically, FETTA enables efficient, privacy-preserving on-device learning with substantial speedups (up to $20.5 imes$ over GPU and $100.9 imes$ over TPU) and energy savings (up to $567.5 imes$ over GPU and $45.0 imes$ over TPU).

Abstract

The increasing demand for on-device training of deep neural networks (DNNs) aims to leverage personal data for high-performance applications while addressing privacy concerns and reducing communication latency. However, resource-constrained platforms face significant challenges due to the intensive computational and memory demands of DNN training. Tensor decomposition emerges as a promising approach to compress model size without sacrificing accuracy. Nevertheless, training tensorized neural networks (TNNs) incurs non-trivial overhead and severe performance degradation on conventional accelerators due to complex tensor shaping requirements. To address these challenges, we propose FETTA, an algorithm and hardware co-optimization framework for efficient TNN training. On the algorithm side, we develop a contraction sequence search engine (CSSE) to identify the optimal contraction sequence with the minimal computational overhead. On the hardware side, FETTA features a flexible and efficient architecture equipped with a reconfigurable contraction engine (CE) array to support diverse dataflows. Furthermore, butterfly-based distribution and reduction networks are implemented to perform flexible tensor shaping operations during computation. Evaluation results demonstrate that FETTA achieves reductions of 20.5x/100.9x, 567.5x/45.03x, and 11609.7x/4544.8x in terms of processing latency, energy, and energy-delay product (EDP) over GPU and TPU, respectively. Moreover, working on the tensorized training, FETTA outperforms prior accelerators with a speedup of 3.87~14.63x, and an energy efficiency improvement of 1.41~2.73x on average.

FETTA: Flexible and Efficient Hardware Accelerator for Tensorized Neural Network Training

TL;DR

Abstract

FETTA: Flexible and Efficient Hardware Accelerator for Tensorized Neural Network Training

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (16)