Table of Contents
Fetching ...

BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers

Chaodong Xiao, Zhengqiang Zhang, Lei Zhang

TL;DR

BinaryAttention is proposed, an effective method for fast and accurate 1-bit qk-attention that retains only the sign of queries and keys in computing the attention, and replaces the floating dot products with bit-wise operations, significantly reducing the computational cost.

Abstract

Transformers have achieved widespread and remarkable success, while the computational complexity of their attention modules remains a major bottleneck for vision tasks. Existing methods mainly employ 8-bit or 4-bit quantization to balance efficiency and accuracy. In this paper, with theoretical justification, we indicate that binarization of attention preserves the essential similarity relationships, and propose BinaryAttention, an effective method for fast and accurate 1-bit qk-attention. Specifically, we retain only the sign of queries and keys in computing the attention, and replace the floating dot products with bit-wise operations, significantly reducing the computational cost. We mitigate the inherent information loss under 1-bit quantization by incorporating a learnable bias, and enable end-to-end acceleration. To maintain the accuracy of attention, we adopt quantization-aware training and self-distillation techniques, mitigating quantization errors while ensuring sign-aligned similarity. BinaryAttention is more than 2x faster than FlashAttention2 on A100 GPUs. Extensive experiments on vision transformer and diffusion transformer benchmarks demonstrate that BinaryAttention matches or even exceeds full-precision attention, validating its effectiveness. Our work provides a highly efficient and effective alternative to full-precision attention, pushing the frontier of low-bit vision and diffusion transformers. The codes and models can be found at https://github.com/EdwardChasel/BinaryAttention.

BinaryAttention: One-Bit QK-Attention for Vision and Diffusion Transformers

TL;DR

BinaryAttention is proposed, an effective method for fast and accurate 1-bit qk-attention that retains only the sign of queries and keys in computing the attention, and replaces the floating dot products with bit-wise operations, significantly reducing the computational cost.

Abstract

Transformers have achieved widespread and remarkable success, while the computational complexity of their attention modules remains a major bottleneck for vision tasks. Existing methods mainly employ 8-bit or 4-bit quantization to balance efficiency and accuracy. In this paper, with theoretical justification, we indicate that binarization of attention preserves the essential similarity relationships, and propose BinaryAttention, an effective method for fast and accurate 1-bit qk-attention. Specifically, we retain only the sign of queries and keys in computing the attention, and replace the floating dot products with bit-wise operations, significantly reducing the computational cost. We mitigate the inherent information loss under 1-bit quantization by incorporating a learnable bias, and enable end-to-end acceleration. To maintain the accuracy of attention, we adopt quantization-aware training and self-distillation techniques, mitigating quantization errors while ensuring sign-aligned similarity. BinaryAttention is more than 2x faster than FlashAttention2 on A100 GPUs. Extensive experiments on vision transformer and diffusion transformer benchmarks demonstrate that BinaryAttention matches or even exceeds full-precision attention, validating its effectiveness. Our work provides a highly efficient and effective alternative to full-precision attention, pushing the frontier of low-bit vision and diffusion transformers. The codes and models can be found at https://github.com/EdwardChasel/BinaryAttention.
Paper Structure (19 sections, 1 theorem, 18 equations, 6 figures, 9 tables, 1 algorithm)

This paper contains 19 sections, 1 theorem, 18 equations, 6 figures, 9 tables, 1 algorithm.

Key Result

Theorem 1

Consider two random variables $\bm{q},\bm{k}\in \mathbb{R}^d$. Suppose that $\bm{z}=(\bm{q}^T,\bm{k}^T)^T\in \mathbb{R}^{2d}$ is a zero-mean Gaussian vector with covariance matrix $\bm{\Sigma}$, where $\bm{\Sigma}=\left[ \right]$. Denote $\bm{D}_q=diag(\bm{\Sigma}_{qq}),\bm{D}_k=diag(\bm{\Sigma}_{k

Figures (6)

  • Figure 1: Top: Performance comparison between FlashAttention2 and BinaryAttention on vision tasks. Bottom: Image generation examples by DiT-XL/2 peebles2023scalable driven by BinaryAttention.
  • Figure 2: Overview and comparative analysis of BinaryAttention. (a) The computation of BinaryAttention involves three components: converting queries and keys into scaled binary representations, applying a bias enhancement, and quantizing the attention coefficients and values. Sub-figures (b) and (c) show the attention maps (top) and the corresponding activation maps (bottom) for Standard Attention and our BinaryAttention, demonstrating the comparable expressivity of BinaryAttention to Standard Attention despite 1-bit quantization.
  • Figure 3: Kernel speed comparison on A100 GPUs.
  • Figure 4: End-to-end throughput and speedup comparisons on A100 GPUs. ViT dosovitskiy2021an models are used.
  • Figure 5: Qualitative comparison of generated image by DiT-XL/2 (cfg=1.50) using FlashAttention2 and BinaryAttention.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Theorem 1
  • proof
  • proof