Table of Contents
Fetching ...

Hamming Attention Distillation: Binarizing Keys and Queries for Efficient Long-Context Transformers

Mark Horton, Tergel Molom-Ochir, Peter Liu, Bhavna Gopal, Chiyue Wei, Cong Guo, Brady Taylor, Deliang Fan, Shan X. Wang, Hai Li, Yiran Chen

TL;DR

Hamming Attention Distillation (HAD) tackles the computational burden of long-context transformers by binarizing keys and queries to $\{-1,+1\}$ and substituting dot-products with Hamming-distance computations, complemented by top-$N$ sparsification of the attention. It trains a binarized student model via distillation from a full-precision teacher using two KL losses to align both attention and outputs, and it refines the binarization through a staged training schedule that moves from a tanh-like approximation toward a sign-like representation with straight-through gradients. HAD delivers state-of-the-art or highly competitive results among binarized-transformer approaches on GLUE and ImageNet, and demonstrates robust long-context performance on QuALITY while achieving substantial hardware efficiency gains through CAM-based in-memory attention and strong area/power advantages. The work highlights practical implications for deploying long-context transformers in server-grade hardware, with potential extensions to GPUs, larger sparsity regimes, reduced-value precision for the value matrix, and decoder-only LLMs.

Abstract

Pre-trained transformer models with extended context windows are notoriously expensive to run at scale, often limiting real-world deployment due to their high computational and memory requirements. In this paper, we introduce Hamming Attention Distillation (HAD), a novel framework that binarizes keys and queries in the attention mechanism to achieve significant efficiency gains. By converting keys and queries into {-1, +1} vectors and replacing dot-product operations with efficient Hamming distance computations, our method drastically reduces computational overhead. Additionally, we incorporate attention matrix sparsification to prune low-impact activations, which further reduces the cost of processing long-context sequences. \par Despite these aggressive compression strategies, our distilled approach preserves a high degree of representational power, leading to substantially improved accuracy compared to prior transformer binarization methods. We evaluate HAD on a range of tasks and models, including the GLUE benchmark, ImageNet, and QuALITY, demonstrating state-of-the-art performance among binarized Transformers while drastically reducing the computational costs of long-context inference. \par We implement HAD in custom hardware simulations, demonstrating superior performance characteristics compared to a custom hardware implementation of standard attention. HAD achieves just $\mathbf{1.78}\%$ performance losses on GLUE compared to $9.08\%$ in state-of-the-art binarization work, and $\mathbf{2.5}\%$ performance losses on ImageNet compared to $12.14\%$, all while targeting custom hardware with a $\mathbf{79}\%$ area reduction and $\mathbf{87}\%$ power reduction compared to its standard attention counterpart.

Hamming Attention Distillation: Binarizing Keys and Queries for Efficient Long-Context Transformers

TL;DR

Hamming Attention Distillation (HAD) tackles the computational burden of long-context transformers by binarizing keys and queries to and substituting dot-products with Hamming-distance computations, complemented by top- sparsification of the attention. It trains a binarized student model via distillation from a full-precision teacher using two KL losses to align both attention and outputs, and it refines the binarization through a staged training schedule that moves from a tanh-like approximation toward a sign-like representation with straight-through gradients. HAD delivers state-of-the-art or highly competitive results among binarized-transformer approaches on GLUE and ImageNet, and demonstrates robust long-context performance on QuALITY while achieving substantial hardware efficiency gains through CAM-based in-memory attention and strong area/power advantages. The work highlights practical implications for deploying long-context transformers in server-grade hardware, with potential extensions to GPUs, larger sparsity regimes, reduced-value precision for the value matrix, and decoder-only LLMs.

Abstract

Pre-trained transformer models with extended context windows are notoriously expensive to run at scale, often limiting real-world deployment due to their high computational and memory requirements. In this paper, we introduce Hamming Attention Distillation (HAD), a novel framework that binarizes keys and queries in the attention mechanism to achieve significant efficiency gains. By converting keys and queries into {-1, +1} vectors and replacing dot-product operations with efficient Hamming distance computations, our method drastically reduces computational overhead. Additionally, we incorporate attention matrix sparsification to prune low-impact activations, which further reduces the cost of processing long-context sequences. \par Despite these aggressive compression strategies, our distilled approach preserves a high degree of representational power, leading to substantially improved accuracy compared to prior transformer binarization methods. We evaluate HAD on a range of tasks and models, including the GLUE benchmark, ImageNet, and QuALITY, demonstrating state-of-the-art performance among binarized Transformers while drastically reducing the computational costs of long-context inference. \par We implement HAD in custom hardware simulations, demonstrating superior performance characteristics compared to a custom hardware implementation of standard attention. HAD achieves just performance losses on GLUE compared to in state-of-the-art binarization work, and performance losses on ImageNet compared to , all while targeting custom hardware with a area reduction and power reduction compared to its standard attention counterpart.

Paper Structure

This paper contains 22 sections, 19 equations, 5 figures, 3 tables, 1 algorithm.

Figures (5)

  • Figure 1: Runtime analysis of BERT Base over increasing context lengths on NVIDIA L40 GPU. As context length rises into the thousands, attention begins to dominate the runtime. The top plot shows the latency of BERT inferenced with and without its attention, and the bottom shows the percentage of latency attributable to attention versus all other operations.
  • Figure 2: Binarized attention mechanism in Hamming Attention Distillation (HAD), illustrating the binarization of keys (K) and queries (Q) and subsequent attention operations.
  • Figure 3: Accuracies measured while progressively distilling a full precision DeiT-T over decreasing N values.
  • Figure 4: Given standard gaussian inputs, the percentage of the largest softmax outputs required to sum to the threshold probability. In effect, how many elements are required to account for some percentage of probability mass.
  • Figure 5: Comparison of HAD and baseline accuracy across different context lengths, evaluated on QuALITY quality. The context lengths are powers of 2, and the accuracy is measured for both methods.