Gated recurrent neural networks discover attention
Nicolas Zucchet, Seijin Kobayashi, Yassir Akram, Johannes von Oswald, Maxime Larcher, Angelika Steger, João Sacramento
TL;DR
This paper demonstrates that gated recurrent neural networks with linear diagonal recurrence and multiplicative gates can exactly implement linear self-attention, revealing a structural bridge between RNNs and Transformer-style attention. Through a formal constructive argument, it shows how to store and process past inputs to reproduce the attention operation with a finite, if large, number of neurons, and it analyzes parameter efficiency and invariances. Empirically, it shows that trained gated RNNs can learn attention-like solutions (teacher-student experiments) and, in in-context learning tasks, discover gradient-descent-like algorithms akin to those used by linear self-attention. The findings suggest attention-like computation can be encoded inside RNNs, offering insights for architecture design, potential compressions, and connections to neuroscience, while also recognizing practical limits due to parameter counts and nonlinearity effects.
Abstract
Recent architectural developments have enabled recurrent neural networks (RNNs) to reach and even surpass the performance of Transformers on certain sequence modeling tasks. These modern RNNs feature a prominent design pattern: linear recurrent layers interconnected by feedforward paths with multiplicative gating. Here, we show how RNNs equipped with these two design elements can exactly implement (linear) self-attention, the main building block of Transformers. By reverse-engineering a set of trained RNNs, we find that gradient descent in practice discovers our construction. In particular, we examine RNNs trained to solve simple in-context learning tasks on which Transformers are known to excel and find that gradient descent instills in our RNNs the same attention-based in-context learning algorithm used by Transformers. Our findings highlight the importance of multiplicative interactions in neural networks and suggest that certain RNNs might be unexpectedly implementing attention under the hood.
