LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator

Guoyu Li; Shengyu Ye; Chunyun Chen; Yang Wang; Fan Yang; Ting Cao; Cheng Liu; Mohamed M. Sabry; Mao Yang

LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator

Guoyu Li, Shengyu Ye, Chunyun Chen, Yang Wang, Fan Yang, Ting Cao, Cheng Liu, Mohamed M. Sabry, Mao Yang

TL;DR

LUT-DLA targets the inefficiency ceiling of extreme low-bit neural network inference by turning models into LUT-based representations via vector quantization. The framework combines a parameterized hardware generator, LUTBoost multistage model conversion, and a co-design space exploration engine to optimize software-hardware configurations, supported by a LUT-Stationary dataflow for memory-centric processing. Key contributions include the LS dataflow, a lightweight multistage training pipeline, alternative similarity metrics (L1, Chebyshev) with STE-based training stability, and quantitative design-space models for computation, memory, and hardware cost. Empirically, LUT-DLA delivers substantial power and area efficiency gains (up to $1.4$–$7.0\times$ and $1.5$–$146.1\times$ respectively) with modest accuracy losses on CNNs and Transformer models, demonstrating practical potential for scalable LUT-based accelerators.

Abstract

The emergence of neural network capabilities invariably leads to a significant surge in computational demands due to expanding model sizes and increased computational complexity. To reduce model size and lower inference costs, recent research has focused on simplifying models and designing hardware accelerators using low-bit quantization. However, due to numerical representation limits, scalar quantization cannot reduce bit width lower than 1-bit, diminishing its benefits. To break through these limitations, we introduce LUT-DLA, a Look-Up Table (LUT) Deep Learning Accelerator Framework that utilizes vector quantization to convert neural network models into LUTs, achieving extreme low-bit quantization. The LUT-DLA framework facilitates efficient and cost-effective hardware accelerator designs and supports the LUTBoost algorithm, which helps to transform various DNN models into LUT-based models via multistage training, drastically cutting both computational and hardware overhead. Additionally, through co-design space exploration, LUT-DLA assesses the impact of various model and hardware parameters to fine-tune hardware configurations for different application scenarios, optimizing performance and efficiency. Our comprehensive experiments show that LUT-DLA achieves improvements in power efficiency and area efficiency with gains of $1.4$~$7.0\times$ and $1.5$~$146.1\times$, respectively, while maintaining only a modest accuracy drop. For CNNs, accuracy decreases by $0.1\%$~$3.1\%$ using the $L_2$ distance similarity, $0.1\%$~$3.4\%$ with the $L_1$ distance similarity, and $0.1\%$~$3.8\%$ when employing the Chebyshev distance similarity. For transformer-based models, the accuracy drop ranges from $1.4\%$ to $3.0\%$.

LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator

TL;DR

–

and

–

respectively) with modest accuracy losses on CNNs and Transformer models, demonstrating practical potential for scalable LUT-based accelerators.

Abstract

and

, respectively, while maintaining only a modest accuracy drop. For CNNs, accuracy decreases by

using the

distance similarity,

with the

distance similarity, and

when employing the Chebyshev distance similarity. For transformer-based models, the accuracy drop ranges from

Paper Structure (27 sections, 13 equations, 14 figures, 9 tables, 2 algorithms)

This paper contains 27 sections, 13 equations, 14 figures, 9 tables, 2 algorithms.

Introduction
Background and Motivation
Approximate Computing in Neural Networks
Vector Quantization for Approx. Matrix Multiplication
Challenges
LUT-DLA Framework Overview
LUT-DLA Hardware Architecture and Dataflow Design
Architecture Overview
LUT-DLA Dataflow Exploration
LUTBoost: efficient model converter
Efficient Multistage Model Transformation
Hardware-Friendly Feature Similarity Comparison
Co-Design Space Search Engine
Design Space Exploration
Model Accuracy Sensitivity
...and 12 more sections

Figures (14)

Figure 1: Comparison of Area and Power Efficiency: LUT-Based Approximate Computing vs. ALU (higher is better, 28 nm FD-SOI@300 Mhz, $1k\times1k\times1k$ matrix multiplication, $V$=vector length, $C$=number of centroids, equivalent bit-width=$V/log_2{C}$)
Figure 2: VQ for Approximating Matrix Multiplication
Figure 3: LUT-DLA Framework
Figure 4: LUT-DLA Hardware Architecture
Figure 5: Centroid Computation Units (CCU) Architecture
...and 9 more figures

LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator

TL;DR

Abstract

LUT-DLA: Lookup Table as Efficient Extreme Low-Bit Deep Learning Accelerator

Authors

TL;DR

Abstract

Table of Contents

Figures (14)