Associative Transformer

Yuwei Sun; Hideya Ochiai; Zhirong Wu; Stephen Lin; Ryota Kanai

Associative Transformer

Yuwei Sun, Hideya Ochiai, Zhirong Wu, Stephen Lin, Ryota Kanai

TL;DR

The Associative Transformer (AiT) addresses inefficiencies in sparse attention by introducing a Global Workspace Layer that couples a low-rank explicit memory of learnable priors with an associative Hopfield memory for token reconstruction. The method writes token representations into memory via a bottleneck attention with a diversity-promoting balance loss and retrieves refined representations through attractor dynamics, yielding improved parameter efficiency and performance on classification and relational reasoning tasks. Extensive experiments show AiT outperforms state-of-the-art sparse transformers like Coordination while using fewer parameters and layers, with notable gains on CIFAR, ImageNet100, and Sort-of-CLEVR. This approach advances localized contextual learning in vision transformers and suggests broader applicability to other domains with sparse attention constraints.

Abstract

Emerging from the pairwise attention in conventional Transformers, there is a growing interest in sparse attention mechanisms that align more closely with localized, contextual learning in the biological brain. Existing studies such as the Coordination method employ iterative cross-attention mechanisms with a bottleneck to enable the sparse association of inputs. However, these methods are parameter inefficient and fail in more complex relational reasoning tasks. To this end, we propose Associative Transformer (AiT) to enhance the association among sparsely attended input tokens, improving parameter efficiency and performance in various vision tasks such as classification and relational reasoning. AiT leverages a learnable explicit memory comprising specialized priors that guide bottleneck attentions to facilitate the extraction of diverse localized tokens. Moreover, AiT employs an associative memory-based token reconstruction using a Hopfield energy function. The extensive empirical experiments demonstrate that AiT requires significantly fewer parameters and attention layers outperforming a broad range of sparse Transformer models. Additionally, AiT outperforms the SOTA sparse Transformer models including the Coordination method on the Sort-of-CLEVR dataset.

Associative Transformer

TL;DR

Abstract

Paper Structure (33 sections, 6 equations, 10 figures, 9 tables, 1 algorithm)

This paper contains 33 sections, 6 equations, 10 figures, 9 tables, 1 algorithm.

Introduction
Related work
Sparse Transformers
Memory mechanisms
Associative Transformer
Vision Transformers for classification tasks
Low-rank priors in explicit memory
Bottleneck attention with a limited capacity
Bottleneck attention balance loss
Information retrieval with associative memory
Attractors
Energy-based retrieval
Experiments
Settings
Datasets
...and 18 more sections

Figures (10)

Figure 1: The scheme of the Associative Transformer (AiT). (a) In a global workspace layer, the input $\mathbb{R}^{B\times N\times E}$ is squashed into vectors $\mathbb{R}^{BN\times E}$. The squashed representations are projected to a latent space of dimension $D<<E$ and are sparsely selected to update the explicit memory via a fixed bottleneck $k<<BN$. The Hopfield network utilizes the memory to reconstruct the input tokens, where a learnable linear transformation (LT) scales the memory contents back to the input dimension $E$. (b) The Associative Transformer block consists of self attention, feed-forward layers, and the global workspace layer. Compared to Vision Transformer (ViT), leveraging the global workspace layer enhances the layer efficiency. A shallower 6-layer AiT is shown to outperform a 12-layer ViT (see Table \ref{['tab:class']}).
Figure 2: Comparison on the Pet dataset. AiT-Medium demonstrated a stable increase in performance outperforming the ViT-Base.
Figure 3: Model size vs. test accuracy for various model configurations. Consolidating all the components in the Coordination block resulted in the best performance of 71.49% maintaining a compact model size of 1.0M.
Figure 4: Examples from the Sort-of-CLEVR dataset sort.
Figure 5: Analysis of operating modes of attention heads in the ViT-Base model. We recognize three different groups of attention heads based on their sparsity scores. Group (I) in light blue: High sparsity heads abundant in the middle layers 3-6. The vast majority of these heads only used 50% or fewer interactions. Group (II) in orange: Middle sparsity heads predominant in layers 2 and 7-10. Less than 80% of the interactions were activated. Group (III) in red: Low sparsity heads observed in high layers 11-12 and the first layer, where the most tokens were attended to. The global workspace layer will provide the inductive bias to attend to the essential tokens more effectively.
...and 5 more figures

Associative Transformer

TL;DR

Abstract

Associative Transformer

Authors

TL;DR

Abstract

Table of Contents

Figures (10)