Using Fast Weights to Attend to the Recent Past
Jimmy Ba, Geoffrey Hinton, Volodymyr Mnih, Joel Z. Leibo, Catalin Ionescu
TL;DR
The paper addresses limited and distinct time-scale memories in recurrent nets by introducing fast associative memory that stores recent hidden states in a decaying fast weight matrix. An inner-loop update enables the current state to be attracted toward past states in proportion to their similarity, effectively implementing attention to the recent past; layer normalization stabilizes this mechanism. Across tasks—associative retrieval, visual attention with glimpses, facial expression recognition, and memory-based agents—the approach improves performance and learning speed over IRNN/LSTM baselines, especially with small hidden sizes and constrained memory. The work suggests a biologically plausible memory architecture that decouples temporary storage from persistent weights and could inform cognitive modeling and reinforcement learning systems.
Abstract
Until recently, research on artificial neural networks was largely restricted to systems with only two types of variable: Neural activities that represent the current or recent input and weights that learn to capture regularities among inputs, outputs and payoffs. There is no good reason for this restriction. Synapses have dynamics at many different time-scales and this suggests that artificial neural networks might benefit from variables that change slower than activities but much faster than the standard weights. These "fast weights" can be used to store temporary memories of the recent past and they provide a neurally plausible way of implementing the type of attention to the past that has recently proved very helpful in sequence-to-sequence models. By using fast weights we can avoid the need to store copies of neural activity patterns.
