TransformerFAM: Feedback attention is working memory
Dongseong Hwang, Weiran Wang, Zhuoyuan Huo, Khe Chai Sim, Pedro Moreno Mengibar
TL;DR
The paper tackles the challenge of processing arbitrarily long sequences with Transformers by introducing TransformerFAM, a memory-augmented architecture that adds a Feedback Attention Memory loop to Block Sliding Window Attention. This design enables the model to attend to its own latent representations, effectively creating a form of working memory without adding new trainable weights and while preserving (near) linear inference costs. Through LoRA-finetuned experiments on 1B, 8B, and 24B Flan-PaLM checkpoints, TransformerFAM demonstrates improved performance on long-context tasks and strong PassKey retrieval results, indicating the potential for LLMs to handle unlimited contexts. The work draws inspiration from neuroscience and global workspace theory, shows careful memory management (e.g., FAM initialization, position encoding, random state passing), and discusses limitations and avenues for future memory enhancements in large-scale models. Overall, TransformerFAM represents a meaningful step toward integrating working memory into Transformers, with practical implications for long-form reasoning and efficient long-context processing.
Abstract
While Transformers have revolutionized deep learning, their quadratic attention complexity hinders their ability to process infinitely long inputs. We propose Feedback Attention Memory (FAM), a novel Transformer architecture that leverages a feedback loop to enable the network to attend to its own latent representations. This design fosters the emergence of working memory within the Transformer, allowing it to process indefinitely long sequences. TransformerFAM requires no additional weights, enabling seamless integration with pre-trained models. Our experiments show that TransformerFAM significantly improves Transformer performance on long-context tasks across various model sizes (1B, 8B, and 24B). These results showcase the potential to empower Large Language Models (LLMs) to process sequences of unlimited length.
