Hamming Attention Distillation: Binarizing Keys and Queries for Efficient Long-Context Transformers
Mark Horton, Tergel Molom-Ochir, Peter Liu, Bhavna Gopal, Chiyue Wei, Cong Guo, Brady Taylor, Deliang Fan, Shan X. Wang, Hai Li, Yiran Chen
TL;DR
Hamming Attention Distillation (HAD) tackles the computational burden of long-context transformers by binarizing keys and queries to $\{-1,+1\}$ and substituting dot-products with Hamming-distance computations, complemented by top-$N$ sparsification of the attention. It trains a binarized student model via distillation from a full-precision teacher using two KL losses to align both attention and outputs, and it refines the binarization through a staged training schedule that moves from a tanh-like approximation toward a sign-like representation with straight-through gradients. HAD delivers state-of-the-art or highly competitive results among binarized-transformer approaches on GLUE and ImageNet, and demonstrates robust long-context performance on QuALITY while achieving substantial hardware efficiency gains through CAM-based in-memory attention and strong area/power advantages. The work highlights practical implications for deploying long-context transformers in server-grade hardware, with potential extensions to GPUs, larger sparsity regimes, reduced-value precision for the value matrix, and decoder-only LLMs.
Abstract
Pre-trained transformer models with extended context windows are notoriously expensive to run at scale, often limiting real-world deployment due to their high computational and memory requirements. In this paper, we introduce Hamming Attention Distillation (HAD), a novel framework that binarizes keys and queries in the attention mechanism to achieve significant efficiency gains. By converting keys and queries into {-1, +1} vectors and replacing dot-product operations with efficient Hamming distance computations, our method drastically reduces computational overhead. Additionally, we incorporate attention matrix sparsification to prune low-impact activations, which further reduces the cost of processing long-context sequences. \par Despite these aggressive compression strategies, our distilled approach preserves a high degree of representational power, leading to substantially improved accuracy compared to prior transformer binarization methods. We evaluate HAD on a range of tasks and models, including the GLUE benchmark, ImageNet, and QuALITY, demonstrating state-of-the-art performance among binarized Transformers while drastically reducing the computational costs of long-context inference. \par We implement HAD in custom hardware simulations, demonstrating superior performance characteristics compared to a custom hardware implementation of standard attention. HAD achieves just $\mathbf{1.78}\%$ performance losses on GLUE compared to $9.08\%$ in state-of-the-art binarization work, and $\mathbf{2.5}\%$ performance losses on ImageNet compared to $12.14\%$, all while targeting custom hardware with a $\mathbf{79}\%$ area reduction and $\mathbf{87}\%$ power reduction compared to its standard attention counterpart.
