Table of Contents
Fetching ...

TransformerFAM: Feedback attention is working memory

Dongseong Hwang, Weiran Wang, Zhuoyuan Huo, Khe Chai Sim, Pedro Moreno Mengibar

TL;DR

The paper tackles the challenge of processing arbitrarily long sequences with Transformers by introducing TransformerFAM, a memory-augmented architecture that adds a Feedback Attention Memory loop to Block Sliding Window Attention. This design enables the model to attend to its own latent representations, effectively creating a form of working memory without adding new trainable weights and while preserving (near) linear inference costs. Through LoRA-finetuned experiments on 1B, 8B, and 24B Flan-PaLM checkpoints, TransformerFAM demonstrates improved performance on long-context tasks and strong PassKey retrieval results, indicating the potential for LLMs to handle unlimited contexts. The work draws inspiration from neuroscience and global workspace theory, shows careful memory management (e.g., FAM initialization, position encoding, random state passing), and discusses limitations and avenues for future memory enhancements in large-scale models. Overall, TransformerFAM represents a meaningful step toward integrating working memory into Transformers, with practical implications for long-form reasoning and efficient long-context processing.

Abstract

While Transformers have revolutionized deep learning, their quadratic attention complexity hinders their ability to process infinitely long inputs. We propose Feedback Attention Memory (FAM), a novel Transformer architecture that leverages a feedback loop to enable the network to attend to its own latent representations. This design fosters the emergence of working memory within the Transformer, allowing it to process indefinitely long sequences. TransformerFAM requires no additional weights, enabling seamless integration with pre-trained models. Our experiments show that TransformerFAM significantly improves Transformer performance on long-context tasks across various model sizes (1B, 8B, and 24B). These results showcase the potential to empower Large Language Models (LLMs) to process sequences of unlimited length.

TransformerFAM: Feedback attention is working memory

TL;DR

The paper tackles the challenge of processing arbitrarily long sequences with Transformers by introducing TransformerFAM, a memory-augmented architecture that adds a Feedback Attention Memory loop to Block Sliding Window Attention. This design enables the model to attend to its own latent representations, effectively creating a form of working memory without adding new trainable weights and while preserving (near) linear inference costs. Through LoRA-finetuned experiments on 1B, 8B, and 24B Flan-PaLM checkpoints, TransformerFAM demonstrates improved performance on long-context tasks and strong PassKey retrieval results, indicating the potential for LLMs to handle unlimited contexts. The work draws inspiration from neuroscience and global workspace theory, shows careful memory management (e.g., FAM initialization, position encoding, random state passing), and discusses limitations and avenues for future memory enhancements in large-scale models. Overall, TransformerFAM represents a meaningful step toward integrating working memory into Transformers, with practical implications for long-form reasoning and efficient long-context processing.

Abstract

While Transformers have revolutionized deep learning, their quadratic attention complexity hinders their ability to process infinitely long inputs. We propose Feedback Attention Memory (FAM), a novel Transformer architecture that leverages a feedback loop to enable the network to attend to its own latent representations. This design fosters the emergence of working memory within the Transformer, allowing it to process indefinitely long sequences. TransformerFAM requires no additional weights, enabling seamless integration with pre-trained models. Our experiments show that TransformerFAM significantly improves Transformer performance on long-context tasks across various model sizes (1B, 8B, and 24B). These results showcase the potential to empower Large Language Models (LLMs) to process sequences of unlimited length.
Paper Structure (45 sections, 3 equations, 11 figures, 14 tables, 2 algorithms)

This paper contains 45 sections, 3 equations, 11 figures, 14 tables, 2 algorithms.

Figures (11)

  • Figure 1: Comparison of query-key attention masks for Sliding Window Attention (SWA) variants. (a) Sliding Window Attention: Attention is restricted to the current window $=3$. (b) Block Sliding Window Attention (BSWA) (block size $=2$, memory segment $=1$): Attention is allowed to previous blocks within the memory segment. (c) BSWA (block size $=2$, memory segment $=2$): The memory segment is expanded, allowing attention to a larger past context. (d) Illustrates the receptive field of BSWA (block size $=2$, memory segment $=1$, depth $=4$): The region within the curly braces represents the receptive field.
  • Figure 2: Comparison of attention patterns in Transformer layer. (a) TransformerBSWA: Input query attends to the current block and two memory segments, providing past context. (b) TransformerFAM: Input query attends to the current block, memory segments, and past FAM (green lines). FAM query (copied from previous FAM, blue dash arrow) compresses the current block to update FAM. This feedback loop enables information compression and propagation over indefinite horizon, which is working memory. \ref{['fig:animation']} shows in detail how the dynamic process occurs over time.
  • Figure 3: (a) PassKey Retrieval: Performance across different Transformer models and memory segment configurations. MX denotes the number of BSWA memory segments. FAM represents TransformerFAM with 0 memory segments. TransformerFAM successfully solves the task. (b) LCT: Normalized scores of long-context tasks evaluated by Flan 1B with different Transformer models and different memory segment configurations. FAM outperforms all other BSWA configurations.
  • Figure 4: Visualization of self-attention during inference over time. (A) Self-attention pattern of TransformerBSWA layer with memory segment size of 1. (B) Self-attention pattern with FAM added.
  • Figure 5: PG-19 accuracy for various ablation studies such as RPO (Random Position Offset), RSP (Random State Passing) and Prefix FAM tuning over different base frequency.
  • ...and 6 more figures