Table of Contents
Fetching ...

Platinum: Path-Adaptable LUT-Based Accelerator Tailored for Low-Bit Weight Matrix Multiplication

Haoxuan Shan, Cong Guo, Chiyue Wei, Feng Cheng, Junyao Zhang, Hai, Li, Yiran Chen

TL;DR

Platinum addresses the efficiency challenge of deploying ultra-low-bit quantized LLMs on edge hardware. It introduces a path-adaptable LUT-based ASIC for mpGEMM with offline MST-based LUT construction enabling dual execution modes for bit-serial and ternary weights. The design achieves up to 73.6x speedups and large energy savings on BitNet-b1.58-3B with a tiny 0.96 mm^2 die, demonstrating practical viability of LUT-based ASICs for ultra-low-bit neural networks at the edge. These results highlight the potential of offline path generation and ternary-optimized LUTs to deliver scalable, energy-efficient inference for edge AI workloads.

Abstract

The rapid scaling of large language models demands more efficient hardware. Quantization offers a promising trade-off between efficiency and performance. With ultra-low-bit quantization, there are abundant opportunities for results reuse, and thus it can be boosted with lookup tables (LUTs) based acceleration. However, existing LUT-based methods suffer from computation and hardware overheads for LUT construction, and rely solely on bit-serial computation, which is suboptimal for ternary-weight networks. We propose Platinum, a lightweight ASIC accelerator for integer weight mixed-precision matrix multiplication (mpGEMM) using LUTs. Platinum reduces LUT construction overhead via offline-generated construction paths and supports both general bit-serial and optimized ternary-weight execution through adaptive path switching. On BitNet b1.58-3B, Platinum achieves up to 73.6x, 4.09x, and 2.15x speedups over SpikingEyeriss, Prosperity, and 16-thread T-MAC (CPU), respectively, along with energy reductions of 32.4x, 3.23x, and 20.9x, all within a 0.96mm2 chip area. This demonstrates the potential of LUT-based ASICs as efficient, scalable solutions for ultra-low-bit neural networks on edge platforms.

Platinum: Path-Adaptable LUT-Based Accelerator Tailored for Low-Bit Weight Matrix Multiplication

TL;DR

Platinum addresses the efficiency challenge of deploying ultra-low-bit quantized LLMs on edge hardware. It introduces a path-adaptable LUT-based ASIC for mpGEMM with offline MST-based LUT construction enabling dual execution modes for bit-serial and ternary weights. The design achieves up to 73.6x speedups and large energy savings on BitNet-b1.58-3B with a tiny 0.96 mm^2 die, demonstrating practical viability of LUT-based ASICs for ultra-low-bit neural networks at the edge. These results highlight the potential of offline path generation and ternary-optimized LUTs to deliver scalable, energy-efficient inference for edge AI workloads.

Abstract

The rapid scaling of large language models demands more efficient hardware. Quantization offers a promising trade-off between efficiency and performance. With ultra-low-bit quantization, there are abundant opportunities for results reuse, and thus it can be boosted with lookup tables (LUTs) based acceleration. However, existing LUT-based methods suffer from computation and hardware overheads for LUT construction, and rely solely on bit-serial computation, which is suboptimal for ternary-weight networks. We propose Platinum, a lightweight ASIC accelerator for integer weight mixed-precision matrix multiplication (mpGEMM) using LUTs. Platinum reduces LUT construction overhead via offline-generated construction paths and supports both general bit-serial and optimized ternary-weight execution through adaptive path switching. On BitNet b1.58-3B, Platinum achieves up to 73.6x, 4.09x, and 2.15x speedups over SpikingEyeriss, Prosperity, and 16-thread T-MAC (CPU), respectively, along with energy reductions of 32.4x, 3.23x, and 20.9x, all within a 0.96mm2 chip area. This demonstrates the potential of LUT-based ASICs as efficient, scalable solutions for ultra-low-bit neural networks on edge platforms.

Paper Structure

This paper contains 15 sections, 3 equations, 10 figures, 1 table, 2 algorithms.

Figures (10)

  • Figure 1: Comparison between original binary matrix vector multiplication (GEMV) and LUT-based optimization using a binary weight matrix $\mathbf{w}$ of shape $(m, k)$ and input vector $\mathbf{x}$ of shape $(k, n)$, where $m=5, n=1, k=2$. LUT-based optimization reduces computation by a factor of $k$.
  • Figure 2: Platinum leveraging programmable construction path to support both bit-serial LUT-based and ternary LUT-based mpGEMM. The path generation is disaggregated to offline to reduce runtime overhead. Refer to \ref{['sec:ternary_lut']} for more details of ternary LUT benefits for ternary weights.
  • Figure 3: Architecture of Platinum Processor
  • Figure 4: Four-stage construction pipeline with build path.
  • Figure 5: #Addition reduction for ternary weights mpGEMM over LUT sizes. Assume $M=1080$.
  • ...and 5 more figures