BERT-LSH: Reducing Absolute Compute For Attention

Zezheng Li; Kingston Yip

BERT-LSH: Reducing Absolute Compute For Attention

Zezheng Li, Kingston Yip

TL;DR

BERT-LSH introduces a locality-sensitive hashing-based attention mechanism that preserves the distinctQ andK representations of BERT while reducing the computation required for attention. By using SimHash with multiple hash functions to identify colliding Q–K vector pairs, it achieves notable KFLOP reductions and fewer dot products compared to full self-attention, at the cost of some implementation overhead. Empirically, BERT-LSH achieves competitive, and in some cases superior, pretraining and fine-tuning results on MLM, SST-2, and SQuAD, indicating improved generalization in resource-constrained settings. The work highlights both the practical potential of LSH-based attention to democratize access to powerful transformers and the need for further optimization for real-world deployment.

Abstract

This study introduces a novel BERT-LSH model that incorporates Locality Sensitive Hashing (LSH) to approximate the attention mechanism in the BERT architecture. We examine the computational efficiency and performance of this model compared to a standard baseline BERT model. Our findings reveal that BERT-LSH significantly reduces computational demand for the self-attention layer while unexpectedly outperforming the baseline model in pretraining and fine-tuning tasks. These results suggest that the LSH-based attention mechanism not only offers computational advantages but also may enhance the model's ability to generalize from its training data. For more information, visit our GitHub repository: https://github.com/leo4life2/algoml-final

BERT-LSH: Reducing Absolute Compute For Attention

TL;DR

Abstract

Paper Structure (23 sections, 8 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 23 sections, 8 equations, 5 figures, 4 tables, 1 algorithm.

Introduction
Methodology
Locality Sensitive Hashing Implementation
Model Implementation
BERT Baseline Model
BERT-LSH Model
Measuring Computational Efficiency
Pretraining
Masked Language Modeling (MLM)
Dataset and Training Procedure
Training Duration and Evaluation
GLUE SST-2 Fine-tuning
Fine-tuning Procedure
Training and Evaluation Metrics
SQuAD Fine-tuning
...and 8 more sections

Figures (5)

Figure 1: KFLOPs of the attention computation for different LSH configurations
Figure 2: Pretraining Losses: BERT-LSH showed better Eval Loss than when compared to the Baseline BERT model during pretraining.
Figure 3: Training Radar Plot: A plot for the training parameters and metrics. Values further away from the center represent "better" performance.
Figure 4: Text Classification Finetune Loss: BERT-LSH showed worse training loss but better evaluation loss than when compared to the Baseline BERT model during finetuning on SST-2 dataset
Figure 5: Question Answering FineTune Loss: SQuAD2.0 dataset

BERT-LSH: Reducing Absolute Compute For Attention

TL;DR

Abstract

BERT-LSH: Reducing Absolute Compute For Attention

Authors

TL;DR

Abstract

Table of Contents

Figures (5)