Table of Contents
Fetching ...

TeLLMe: An Energy-Efficient Ternary LLM Accelerator for Prefilling and Decoding on Edge FPGAs

Ye Qiao, Zhiheng Chen, Yifan Zhang, Yian Wang, Sitao Huang

TL;DR

TeLLMe is presented, the first ternary LLM accelerator for low-power FPGAs that fully supports both prefill and autoregressive decoding using 1.58-bit weights and 8-bit activations and contains a tightly integrated normalization and quantization--dequantization unit optimized for ultra-low-bit inference.

Abstract

Deploying large language models (LLMs) on edge platforms is challenged by their high computational and memory demands. Although recent low-bit quantization methods (e.g., BitNet, DeepSeek) compress weights to as little as 1.58 bits with minimal accuracy loss, edge deployment is still constrained by limited on-chip resources, power budgets, and the often-neglected latency of the prefill phase. We present TeLLMe, the first ternary LLM accelerator for low-power FPGAs (e.g., AMD KV260) that fully supports both prefill and autoregressive decoding using 1.58-bit weights and 8-bit activations. Our contributions include: (1) a table-lookup matrix engine for ternary matmul that merges grouped activations with online precomputation to minimize resource use; (2) a fused, bandwidth-efficient attention module featuring a reversed reordering scheme to accelerate prefill; and (3) a tightly integrated normalization and quantization--dequantization unit optimized for ultra-low-bit inference. Under a 7W power budget, TeLLMe delivers up to 9 tokens/s throughput over 1,024-token contexts and prefill latencies of 0.55--1.15 s for 64--128 token prompts, marking a significant energy-efficiency advance and establishing a new edge FPGA benchmark for generative AI.

TeLLMe: An Energy-Efficient Ternary LLM Accelerator for Prefilling and Decoding on Edge FPGAs

TL;DR

TeLLMe is presented, the first ternary LLM accelerator for low-power FPGAs that fully supports both prefill and autoregressive decoding using 1.58-bit weights and 8-bit activations and contains a tightly integrated normalization and quantization--dequantization unit optimized for ultra-low-bit inference.

Abstract

Deploying large language models (LLMs) on edge platforms is challenged by their high computational and memory demands. Although recent low-bit quantization methods (e.g., BitNet, DeepSeek) compress weights to as little as 1.58 bits with minimal accuracy loss, edge deployment is still constrained by limited on-chip resources, power budgets, and the often-neglected latency of the prefill phase. We present TeLLMe, the first ternary LLM accelerator for low-power FPGAs (e.g., AMD KV260) that fully supports both prefill and autoregressive decoding using 1.58-bit weights and 8-bit activations. Our contributions include: (1) a table-lookup matrix engine for ternary matmul that merges grouped activations with online precomputation to minimize resource use; (2) a fused, bandwidth-efficient attention module featuring a reversed reordering scheme to accelerate prefill; and (3) a tightly integrated normalization and quantization--dequantization unit optimized for ultra-low-bit inference. Under a 7W power budget, TeLLMe delivers up to 9 tokens/s throughput over 1,024-token contexts and prefill latencies of 0.55--1.15 s for 64--128 token prompts, marking a significant energy-efficiency advance and establishing a new edge FPGA benchmark for generative AI.

Paper Structure

This paper contains 21 sections, 3 equations, 9 figures, 5 tables, 1 algorithm.

Figures (9)

  • Figure 1: Breakdown of TeLLMe 1.58-bit Model Inference Process with Prefill and Generation
  • Figure 2: Dataflow and architecture of TL-based ternary matMul ($G = 4$)
  • Figure 3: System architecture of TeLLMe.
  • Figure 4: The visualization of scheduling on the attention map (number of computation core $p = 4$, beige stands for attention mask).
  • Figure 5: Naive attention scheduling ($p = 4$).
  • ...and 4 more figures