Maelstrom Networks

Matthew Evanusa; Cornelia Fermüller; Yiannis Aloimonos

Maelstrom Networks

Matthew Evanusa, Cornelia Fermüller, Yiannis Aloimonos

TL;DR

The paper addresses the absence of robust working memory in contemporary neural nets and argues that true sequence memory requires a dedicated, dynamically evolving state. It introduces Maelstrom Networks, a modular architecture in which an unlearned recurrent Maelstrom provides memory, while a trainable input network and readout perform learning without backpropagating through the Maelstrom. The approach fuses ideas from control theory, reservoir computing, and deep learning to achieve online, real-time sequence processing and to enable neuromorphic hardware implementations, continual learning, and a potential bridging of AI with neuroscience. This framework aims to give artificial agents a persistent internal state and a sense of self, with practical implications for embodied AI and energy-efficient, scalable learning systems.

Abstract

Artificial Neural Networks has struggled to devise a way to incorporate working memory into neural networks. While the ``long term'' memory can be seen as the learned weights, the working memory consists likely more of dynamical activity, that is missing from feed-forward models. Current state of the art models such as transformers tend to ``solve'' this by ignoring working memory entirely and simply process the sequence as an entire piece of data; however this means the network cannot process the sequence in an online fashion, and leads to an immense explosion in memory requirements. Here, inspired by a combination of controls, reservoir computing, deep learning, and recurrent neural networks, we offer an alternative paradigm that combines the strength of recurrent networks, with the pattern matching capability of feed-forward neural networks, which we call the \textit{Maelstrom Networks} paradigm. This paradigm leaves the recurrent component - the \textit{Maelstrom} - unlearned, and offloads the learning to a powerful feed-forward network. This allows the network to leverage the strength of feed-forward training without unrolling the network, and allows for the memory to be implemented in new neuromorphic hardware. It endows a neural network with a sequential memory that takes advantage of the inductive bias that data is organized causally in the temporal domain, and imbues the network with a state that represents the agent's ``self'', moving through the environment. This could also lead the way to continual learning, with the network modularized and ``'protected'' from overwrites that come with new data. In addition to aiding in solving these performance problems that plague current non-temporal deep networks, this also could finally lead towards endowing artificial networks with a sense of ``self''.

Maelstrom Networks

TL;DR

Abstract

Paper Structure (12 sections, 5 figures)

This paper contains 12 sections, 5 figures.

Introduction
Limitations of Current Approaches
Feed-Forward Networks Lack Sequence Memory
Memory Serves in the Service of Embodiment
Prior Work on Connectionist Memory
Memory in Connectionist Networks
Hopfield Networks
Gradient-Based RNNs
Reservoir Computing
Maelstrom Networks
Advantages of Maelstrom Networks
Conclusion

Figures (5)

Figure 1: As opposed to the current zeitgeist of machine learning, data in the real world follows inductive biases in how the data is structured not only in the spatial domain - for which we have taken heavy account - but also in the temporal one. Left: The current view of machine learning which sees data as independently and identically distributed (I.I.D) in time; it is the job of the network to learn the features in time that correspond to each data point. This assumes the inductive bias of the spatial hierarchy of the data, but not the temporo-causal relationships of the data points along the same time thread. Right: The new view of networks as also accounting for the temporal inductive bias of the data along the temporal dimension. The points are related to one another on the same temporal thread via the referential frame of action: it is those actions which took a point from one location to another, as a result of the causal effect of the action. The ability of the network to recognize data along the same thread is what we refer to here as Sequence Memory.
Figure 2: Comparing the various prior approaches to dealing with temporal sequential memory (or lack thereof). Blue rectangles indicate a feed-forward network layer (no memory), while green represent a recurrently connected one or memory that is accessible through timesteps. Black arrows represent a learnable weight, whereas red indicates unlearned, where the gradients do not pass backwards. From left to right: Transformers or CNNs lecun1998gradientdevlin2018bertvaswani2017attention are in their vanilla state purely feed-forward networks, and are not represented here because they do not contain any form of sequence memory. Only newer variants of Transformers such as Transformer-XL, which contain a "cache" that is accessible to later timesteps, have what we would consider sequence memory. However, these are not "true" sequence memory as they do not solve the continual learning problem, due to the fact that they are still non-modular hadsell2020embracing (i.e., the memory component is still "hooked" to the motor networks) and thus any new gradients would overwrite previous timesteps. Reservoir networks (echo state networks or liquid state machines) have a recurrent component with unlearned recurrent weights and input, but a learned readout that can be a feed-forward network evanusa2023t. RNNs and LSTMs have a recurrent component where every connection is learnable. Both LSTMs and Reservoirs can have multiple recurrent "layers", connected in a hierarchical structure gallicchio2017deep. Lastly, Hopfield hopfield2007hopfield and Self Organizing Maps kohonen1990self are recurrent components without a learned readout, where the recurrent weights are trained using an unsupervised self organizing rule.
Figure 3: The topological organization of the nervous system as proposed by Frank Rosenblatt in 1961, taken from rosenblatt1961principles. This bears a strong resemblance to the Maelstrom Paradigm: The sensory tracts and memory represent the input function, the motor tracts represent the output function, and the integration network is the Maelstrom. The only difference, and the only thing lacking from Rosenblatt's account in our opinion, is the notion of sequence memory. The Maelstrom can be seen as an implementation of Rosenblatt's ideas, in combination with deep learning as well as notions of state and sequence memory.
Figure 4: The Maelstrom Network paradigm. Arrows indicate The input is passed through an input network, a feed-forward neural network which maps input patterns to control actions on the maelstrom. This is passed then to an interface, which serves as the hub for talking to and from the maelstrom. The interface passes to the maelstrom, a recurrently connected state space that collects and conglomerates actions from the controllerl. The maelstrom bounces and maintains a state of the previous input. As the maelstrom is recurrent and unlearned, or unhooked, from the gradient of the output, it exhibits chaotic behavior - it is the job of the input network to control this activity. The interface then reads the maelstrom and passes this to an output function, which then produces an output. For neural network approaches, the input function, output function, and interface are all multilayered neural networks. Critically, when learning, the gradients cannot flow through the maelstrom; this entails the exhaustive unrolling of the network to compute accurate gradients, and its highly biologically unlikely. In contrast, the maelstrom does not need to unroll, which makes it more attractive as well for a biological model that is able to account for the randomness of connections while also preserving computational power. Black lines indicate connections where error can back-propagate and induce learning, red indicate connections that do not allow error to back-propagate. This also allows for a "skip" connection between the interface components to assist in gradient propagation (dotted line).
Figure 5: The proposed relationship of the Malestrom paradigm with functional modules in the human brain. One of the main thrusts of the Maelstrom paradigm was to create a model that both provides advancements to artificial intelligence, and to human brain understanding, simultaneously; this is the ideal goal of artificial intelligence research. The stimulus enters the brain through the sensory cortices, such as the visual and auditory cortex. These cortices contain top-down feedback control (which is exemplified by the feedback from the maelstrom to input controller), but are seen as functionally feed-forward (i.e., the gradients do not pass recurrently). These sensory cortices map to the input network of the Maelstrom. These features are passed to the executive modules, which are a large open question in neuroscience as to their location, but we suggest they exist likely in the hub regions that control, recieve from, and regulate multiple regions. In the Maelstrom Paradigm, the executive is rolled into the input controller network, however we envision future work as a separate executive module distinct from the sensory network. This network then sends data to the Maelstrom, where it bounces around recurrently in a "storm" of chaotic activity; this is in our view located as a mix of the prefrontal cortex as well as the hippocampus. Also left for future work is consolidatio of memory, or learning of memory features, in the maelstrom, as it is left completely untrained in the current simpler iteration. Lastly, the output is sent (either from basal ganglia-learned actions, explicit control, or reflexes), through the executive, to the cerebellum where it learns the correct weights for mapping these to motor actions. The positive feedback control from the maelstrom back to the controller is represented in the maelstrom network via the feedback connection from the maelstrom to the input network, this creates a larger meta-loop within the system at a level above the loops within the maelstrom, and contributes as well to the phenomenon of "self" of the system.

Maelstrom Networks

TL;DR

Abstract

Maelstrom Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (5)