Table of Contents
Fetching ...

BERT-LSH: Reducing Absolute Compute For Attention

Zezheng Li, Kingston Yip

TL;DR

BERT-LSH introduces a locality-sensitive hashing-based attention mechanism that preserves the distinctQ andK representations of BERT while reducing the computation required for attention. By using SimHash with multiple hash functions to identify colliding Q–K vector pairs, it achieves notable KFLOP reductions and fewer dot products compared to full self-attention, at the cost of some implementation overhead. Empirically, BERT-LSH achieves competitive, and in some cases superior, pretraining and fine-tuning results on MLM, SST-2, and SQuAD, indicating improved generalization in resource-constrained settings. The work highlights both the practical potential of LSH-based attention to democratize access to powerful transformers and the need for further optimization for real-world deployment.

Abstract

This study introduces a novel BERT-LSH model that incorporates Locality Sensitive Hashing (LSH) to approximate the attention mechanism in the BERT architecture. We examine the computational efficiency and performance of this model compared to a standard baseline BERT model. Our findings reveal that BERT-LSH significantly reduces computational demand for the self-attention layer while unexpectedly outperforming the baseline model in pretraining and fine-tuning tasks. These results suggest that the LSH-based attention mechanism not only offers computational advantages but also may enhance the model's ability to generalize from its training data. For more information, visit our GitHub repository: https://github.com/leo4life2/algoml-final

BERT-LSH: Reducing Absolute Compute For Attention

TL;DR

BERT-LSH introduces a locality-sensitive hashing-based attention mechanism that preserves the distinctQ andK representations of BERT while reducing the computation required for attention. By using SimHash with multiple hash functions to identify colliding Q–K vector pairs, it achieves notable KFLOP reductions and fewer dot products compared to full self-attention, at the cost of some implementation overhead. Empirically, BERT-LSH achieves competitive, and in some cases superior, pretraining and fine-tuning results on MLM, SST-2, and SQuAD, indicating improved generalization in resource-constrained settings. The work highlights both the practical potential of LSH-based attention to democratize access to powerful transformers and the need for further optimization for real-world deployment.

Abstract

This study introduces a novel BERT-LSH model that incorporates Locality Sensitive Hashing (LSH) to approximate the attention mechanism in the BERT architecture. We examine the computational efficiency and performance of this model compared to a standard baseline BERT model. Our findings reveal that BERT-LSH significantly reduces computational demand for the self-attention layer while unexpectedly outperforming the baseline model in pretraining and fine-tuning tasks. These results suggest that the LSH-based attention mechanism not only offers computational advantages but also may enhance the model's ability to generalize from its training data. For more information, visit our GitHub repository: https://github.com/leo4life2/algoml-final
Paper Structure (23 sections, 8 equations, 5 figures, 4 tables, 1 algorithm)

This paper contains 23 sections, 8 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: KFLOPs of the attention computation for different LSH configurations
  • Figure 2: Pretraining Losses: BERT-LSH showed better Eval Loss than when compared to the Baseline BERT model during pretraining.
  • Figure 3: Training Radar Plot: A plot for the training parameters and metrics. Values further away from the center represent "better" performance.
  • Figure 4: Text Classification Finetune Loss: BERT-LSH showed worse training loss but better evaluation loss than when compared to the Baseline BERT model during finetuning on SST-2 dataset
  • Figure 5: Question Answering FineTune Loss: SQuAD2.0 dataset