BERT-LSH: Reducing Absolute Compute For Attention
Zezheng Li, Kingston Yip
TL;DR
BERT-LSH introduces a locality-sensitive hashing-based attention mechanism that preserves the distinctQ andK representations of BERT while reducing the computation required for attention. By using SimHash with multiple hash functions to identify colliding Q–K vector pairs, it achieves notable KFLOP reductions and fewer dot products compared to full self-attention, at the cost of some implementation overhead. Empirically, BERT-LSH achieves competitive, and in some cases superior, pretraining and fine-tuning results on MLM, SST-2, and SQuAD, indicating improved generalization in resource-constrained settings. The work highlights both the practical potential of LSH-based attention to democratize access to powerful transformers and the need for further optimization for real-world deployment.
Abstract
This study introduces a novel BERT-LSH model that incorporates Locality Sensitive Hashing (LSH) to approximate the attention mechanism in the BERT architecture. We examine the computational efficiency and performance of this model compared to a standard baseline BERT model. Our findings reveal that BERT-LSH significantly reduces computational demand for the self-attention layer while unexpectedly outperforming the baseline model in pretraining and fine-tuning tasks. These results suggest that the LSH-based attention mechanism not only offers computational advantages but also may enhance the model's ability to generalize from its training data. For more information, visit our GitHub repository: https://github.com/leo4life2/algoml-final
