Table of Contents
Fetching ...

TerEffic: Highly Efficient Ternary LLM Inference on FPGA

Chenyang Yin, Zhenyu Bai, Pranav Venkatram, Shivam Aggarwal, Zhaoying Li, Tulika Mitra

TL;DR

TerEffic presents an FPGA-based accelerator tailored for ternary LLM inference, addressing edge-device constraints by combining 1.6-bit weight compression, a dedicated TMat Core, and a memory-aware architecture with two viable configurations. The fully on-chip variant exploits SRAM bandwidth across multiple FPGAs to achieve high throughput and energy efficiency for smaller models (e.g., 370M), while the HBM-assisted variant scales to larger models (up to 2.7B) using off-chip memory and batch-parallelism. Experimental results show substantial gains over edge GPUs (e.g., 16,300 tokens/s at 455 tokens/s/W for 370M) and competitive performance against data-center GPUs for larger scales (e.g., 727 tokens/s at 16 tokens/s/W for 2.7B), highlighting TerEffic’s potential for edge deployments. The work demonstrates the practicality of fully on-chip or HBM-assisted ternary LLM inference on FPGA, offering a path toward energy-efficient, scalable, hardware-tailored LLM deployment.

Abstract

Deploying Large Language Models (LLMs) efficiently on edge devices is often constrained by limited memory capacity and high power consumption. Low-bit quantization methods, particularly ternary quantization, have demonstrated significant potential in preserving model accuracy while substantially decreasing memory footprint and computational costs. However, existing general-purpose architectures and accelerators have not fully exploited the advantages of low-bit quantization due to insufficient specialized hardware support. We introduce TerEffic, an FPGA-based architecture tailored for ternary-quantized LLM inference. The proposed system offers flexibility through reconfigurable hardware to meet various system requirements. We evaluated two representative configurations: a fully on-chip design that stores all weights within on-chip memories, scaling out using multiple FPGAs, and an HBM-assisted design capable of accommodating larger models on a single FPGA board. Experimental results demonstrate significant performance and energy efficiency improvements. For single-batch inference on a 370 M-parameter model, our fully on-chip architecture achieves 16,300 tokens/second, delivering a throughput 192 times higher than NVIDIA Jetson Orin Nano with a power efficiency of 455 tokens/second/W, marking a 19-fold improvement. The HBM-assisted architecture processes 727 tokens/second for a larger 2.7B-parameter model, which is 3 times of the throughput of NVIDIA A100, while consuming only 46W, resulting in a power efficiency of 16 tokens/second/W, an 8-fold improvement over the A100.

TerEffic: Highly Efficient Ternary LLM Inference on FPGA

TL;DR

TerEffic presents an FPGA-based accelerator tailored for ternary LLM inference, addressing edge-device constraints by combining 1.6-bit weight compression, a dedicated TMat Core, and a memory-aware architecture with two viable configurations. The fully on-chip variant exploits SRAM bandwidth across multiple FPGAs to achieve high throughput and energy efficiency for smaller models (e.g., 370M), while the HBM-assisted variant scales to larger models (up to 2.7B) using off-chip memory and batch-parallelism. Experimental results show substantial gains over edge GPUs (e.g., 16,300 tokens/s at 455 tokens/s/W for 370M) and competitive performance against data-center GPUs for larger scales (e.g., 727 tokens/s at 16 tokens/s/W for 2.7B), highlighting TerEffic’s potential for edge deployments. The work demonstrates the practicality of fully on-chip or HBM-assisted ternary LLM inference on FPGA, offering a path toward energy-efficient, scalable, hardware-tailored LLM deployment.

Abstract

Deploying Large Language Models (LLMs) efficiently on edge devices is often constrained by limited memory capacity and high power consumption. Low-bit quantization methods, particularly ternary quantization, have demonstrated significant potential in preserving model accuracy while substantially decreasing memory footprint and computational costs. However, existing general-purpose architectures and accelerators have not fully exploited the advantages of low-bit quantization due to insufficient specialized hardware support. We introduce TerEffic, an FPGA-based architecture tailored for ternary-quantized LLM inference. The proposed system offers flexibility through reconfigurable hardware to meet various system requirements. We evaluated two representative configurations: a fully on-chip design that stores all weights within on-chip memories, scaling out using multiple FPGAs, and an HBM-assisted design capable of accommodating larger models on a single FPGA board. Experimental results demonstrate significant performance and energy efficiency improvements. For single-batch inference on a 370 M-parameter model, our fully on-chip architecture achieves 16,300 tokens/second, delivering a throughput 192 times higher than NVIDIA Jetson Orin Nano with a power efficiency of 455 tokens/second/W, marking a 19-fold improvement. The HBM-assisted architecture processes 727 tokens/second for a larger 2.7B-parameter model, which is 3 times of the throughput of NVIDIA A100, while consuming only 46W, resulting in a power efficiency of 16 tokens/second/W, an 8-fold improvement over the A100.

Paper Structure

This paper contains 24 sections, 1 equation, 12 figures, 6 tables.

Figures (12)

  • Figure 1: On-chip vs. Off-chip Memory for Inference: Throughput and Energy Trends with Increasing Model Parameters
  • Figure 2: Architecture Overview
  • Figure 3: 1.6-Bit Weight Compression
  • Figure 4: RMSNorm Module
  • Figure 5: Ternary Matrix Multiplication(TMat) Core
  • ...and 7 more figures