Table of Contents
Fetching ...

A Tensor-Train Decomposition based Compression of LLMs on Group Vector Systolic Accelerator

Sixiao Huang, Tintin Wang, Ang Li, Ao Shen, Kai Li, Keyao Jiang, Mingqiang Huang, Hao Yu

TL;DR

The paper tackles the storage and compute bottlenecks of deploying large language models on resource-constrained hardware by applying Tensor-Train Decomposition to compress linear layers. It combines TT-based compression with a group vector systolic array (GVSA) FPGA accelerator and operator fusion to realize efficient TT inference for LLMs. Across ChatGLM3-6B and LLaMA2-7B, it achieves whole-network compression factors of $1.94\times$ and $1.60\times$, respectively, and attains substantial first-token-delay reductions ($1.45\times$ and $1.57\times$) along with up to $3.22\times$ speedups in MLP blocks. This work demonstrates practical, edge-friendly LLM deployment with competitive throughput and acceptable accuracy loss, highlighting TT-based compression as a viable path for next-generation hardware-software co-design.

Abstract

Large language models (LLMs) are both storage-intensive and computation-intensive, posing significant challenges when deployed on resource-constrained hardware. As linear layers in LLMs are mainly resource consuming parts, this paper develops a tensor-train decomposition (TTD) for LLMs with a further hardware implementation on FPGA. TTD compression is applied to the linear layers in ChatGLM3-6B and LLaMA2-7B models with compression ratios (CRs) for the whole network 1.94$\times$ and 1.60$\times$, respectively. The compressed LLMs are further implemented on FPGA hardware within a highly efficient group vector systolic array (GVSA) architecture, which has DSP-shared parallel vector PEs for TTD inference, as well as optimized data communication in the accelerator. Experimental results show that the corresponding TTD based LLM accelerator implemented on FPGA achieves 1.45$\times$ and 1.57$\times$ reduction in first token delay for ChatGLM3-6B and LLaMA2-7B models, respectively.

A Tensor-Train Decomposition based Compression of LLMs on Group Vector Systolic Accelerator

TL;DR

The paper tackles the storage and compute bottlenecks of deploying large language models on resource-constrained hardware by applying Tensor-Train Decomposition to compress linear layers. It combines TT-based compression with a group vector systolic array (GVSA) FPGA accelerator and operator fusion to realize efficient TT inference for LLMs. Across ChatGLM3-6B and LLaMA2-7B, it achieves whole-network compression factors of and , respectively, and attains substantial first-token-delay reductions ( and ) along with up to speedups in MLP blocks. This work demonstrates practical, edge-friendly LLM deployment with competitive throughput and acceptable accuracy loss, highlighting TT-based compression as a viable path for next-generation hardware-software co-design.

Abstract

Large language models (LLMs) are both storage-intensive and computation-intensive, posing significant challenges when deployed on resource-constrained hardware. As linear layers in LLMs are mainly resource consuming parts, this paper develops a tensor-train decomposition (TTD) for LLMs with a further hardware implementation on FPGA. TTD compression is applied to the linear layers in ChatGLM3-6B and LLaMA2-7B models with compression ratios (CRs) for the whole network 1.94 and 1.60, respectively. The compressed LLMs are further implemented on FPGA hardware within a highly efficient group vector systolic array (GVSA) architecture, which has DSP-shared parallel vector PEs for TTD inference, as well as optimized data communication in the accelerator. Experimental results show that the corresponding TTD based LLM accelerator implemented on FPGA achieves 1.45 and 1.57 reduction in first token delay for ChatGLM3-6B and LLaMA2-7B models, respectively.

Paper Structure

This paper contains 16 sections, 4 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: LLMs with TTD compressed linear layers mapping on FPGA implemented group vector systolic accelerator.
  • Figure 2: Architecture of ChatGLM3-6B and LLaMA2-7B.
  • Figure 3: Basic principles of tensorization and TTD (Example with $d=3$).
  • Figure 4: Operator execution order for inference of TTD compressed LLMs.
  • Figure 5: Architecture of GVSA when $T_n=T_{out}/2$.
  • ...and 4 more figures