Table of Contents
Fetching ...

Cottention: Linear Transformers With Cosine Attention

Gabriel Mongaras, Trevor Dohm, Eric C. Larson

TL;DR

The results show that Cottention is a promising alternative to softmax attention, enabling the processing of longer sequences without sacrificing performance, due to its native linear memory complexity and ability to maintain a constant memory footprint during inference.

Abstract

Attention mechanisms, particularly softmax attention, have been instrumental in the success of transformer-based models such as GPT. However, the quadratic memory complexity of softmax attention with respect to sequence length poses significant challenges for processing longer sequences. We introduce Cottention, a novel attention mechanism that replaces the softmax operation with cosine similarity. By leveraging the properties of cosine similarity and rearranging the attention equation, Cottention achieves native linear memory complexity with respect to sequence length, making it inherently more memory-efficient than softmax attention. We demonstrate that Cottention can be reformulated as a recurrent neural network (RNN) with a finite hidden state, allowing for constant memory usage during inference. We evaluate Cottention on both the bidirectional BERT and causal GPT tasks, demonstrating comparable performance to softmax attention while significantly reducing memory requirements. To ensure efficient computation, we develop a custom CUDA kernel for Cottention. Our results show that Cottention is a promising alternative to softmax attention, enabling the processing of longer sequences without sacrificing performance, due to its native linear memory complexity and ability to maintain a constant memory footprint during inference.

Cottention: Linear Transformers With Cosine Attention

TL;DR

The results show that Cottention is a promising alternative to softmax attention, enabling the processing of longer sequences without sacrificing performance, due to its native linear memory complexity and ability to maintain a constant memory footprint during inference.

Abstract

Attention mechanisms, particularly softmax attention, have been instrumental in the success of transformer-based models such as GPT. However, the quadratic memory complexity of softmax attention with respect to sequence length poses significant challenges for processing longer sequences. We introduce Cottention, a novel attention mechanism that replaces the softmax operation with cosine similarity. By leveraging the properties of cosine similarity and rearranging the attention equation, Cottention achieves native linear memory complexity with respect to sequence length, making it inherently more memory-efficient than softmax attention. We demonstrate that Cottention can be reformulated as a recurrent neural network (RNN) with a finite hidden state, allowing for constant memory usage during inference. We evaluate Cottention on both the bidirectional BERT and causal GPT tasks, demonstrating comparable performance to softmax attention while significantly reducing memory requirements. To ensure efficient computation, we develop a custom CUDA kernel for Cottention. Our results show that Cottention is a promising alternative to softmax attention, enabling the processing of longer sequences without sacrificing performance, due to its native linear memory complexity and ability to maintain a constant memory footprint during inference.
Paper Structure (24 sections, 11 equations, 4 figures, 1 table)

This paper contains 24 sections, 11 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Recurrent neural network representation of cosine attention where the queries, keys, and values are of shape $(N, H, (d_{H\_key/H\_key/H\_value}))$ and the hidden state is shape $(N, H, d_{H\_value}, d_{H\_key})$. $\otimes$ represents an outer product, $\odot$ represents an inner product, and $\oplus$ is a position-wise addition. The hidden state $H_0$ is initialized to the zero matrix or null matrix.
  • Figure 2: Cosine attention has constant memory during inference while softmax attention has a quadratic increase under the naive implementation and linear increase using KV cache. This makes cosine attention more suitable for processing long sequences, especially in scenarios where memory is limited or the sequence length is not known in advance.
  • Figure 3: Perplexity comparison for models with 300M (left) and 1.2B (right) parameters.
  • Figure 5: Time and memory usage comparison between softmax and cosine attention models. Softmax models exhibit quadratic complexity, while cosine models demonstrate linear complexity with respect to sequence length. Interestingly, the memory usage of the cosine attention models doesn't seem to scale quadratically with respect to dimension.