Table of Contents
Fetching ...

Blending Complementary Memory Systems in Hybrid Quadratic-Linear Transformers

Kazuki Irie, Morris Yau, Samuel J. Gershman

TL;DR

This work investigates integrating two memory paradigms—KV-memory with softmax attention (quadratic transformers) and FW-memory with linear attention (DeltaNet/linear transformers)—to create Hybrid Quadratic-Linear Transformers (HQLTs) for general sequence processing. It introduces three blending schemes (Delayed-Streaming, Delayed-Chunk, and Synchronous) and systematically evaluates them on large-scale language modeling, expressivity benchmarks, in-context retrieval, and reinforcement learning in POMDPs. Across tasks, the Synchronous HQLT consistently yields the strongest overall performance, leveraging simultaneous processing in KV- and FW-memory to combine precise recall with expressive computation. The results provide a principled view on designing neural memory systems, showing that a carefully synchronized hybrid can overcome the limitations of its individual components and shed light on memory design in future architectures.

Abstract

We develop hybrid memory architectures for general-purpose sequence processing neural networks, that combine key-value memory using softmax attention (KV-memory) with fast weight memory through dynamic synaptic modulation (FW-memory) -- the core principles of quadratic and linear transformers, respectively. These two memory systems have complementary but individually limited properties: KV-memory offers precise retrieval but is constrained by quadratic complexity in sequence length, while FW-memory supports arbitrarily long sequences and enables more expressive computation but sacrifices precise recall. We propose and compare three methods to blend these two systems into a single memory system, differing in how and when input information is delivered to each system, to leverage the strengths of both. We conduct experiments on general language modeling and retrieval tasks by training 340M- and 1.3B-parameter models from scratch, as well as on synthetic algorithmic tasks designed to precisely illustrate the benefits of certain hybrid methods over others. We also evaluate our hybrid memory systems on reinforcement learning in partially observable environments. Overall, we demonstrate how a well-designed hybrid can overcome the limitations of its individual components, offering new insights into the design principle of neural memory systems.

Blending Complementary Memory Systems in Hybrid Quadratic-Linear Transformers

TL;DR

This work investigates integrating two memory paradigms—KV-memory with softmax attention (quadratic transformers) and FW-memory with linear attention (DeltaNet/linear transformers)—to create Hybrid Quadratic-Linear Transformers (HQLTs) for general sequence processing. It introduces three blending schemes (Delayed-Streaming, Delayed-Chunk, and Synchronous) and systematically evaluates them on large-scale language modeling, expressivity benchmarks, in-context retrieval, and reinforcement learning in POMDPs. Across tasks, the Synchronous HQLT consistently yields the strongest overall performance, leveraging simultaneous processing in KV- and FW-memory to combine precise recall with expressive computation. The results provide a principled view on designing neural memory systems, showing that a carefully synchronized hybrid can overcome the limitations of its individual components and shed light on memory design in future architectures.

Abstract

We develop hybrid memory architectures for general-purpose sequence processing neural networks, that combine key-value memory using softmax attention (KV-memory) with fast weight memory through dynamic synaptic modulation (FW-memory) -- the core principles of quadratic and linear transformers, respectively. These two memory systems have complementary but individually limited properties: KV-memory offers precise retrieval but is constrained by quadratic complexity in sequence length, while FW-memory supports arbitrarily long sequences and enables more expressive computation but sacrifices precise recall. We propose and compare three methods to blend these two systems into a single memory system, differing in how and when input information is delivered to each system, to leverage the strengths of both. We conduct experiments on general language modeling and retrieval tasks by training 340M- and 1.3B-parameter models from scratch, as well as on synthetic algorithmic tasks designed to precisely illustrate the benefits of certain hybrid methods over others. We also evaluate our hybrid memory systems on reinforcement learning in partially observable environments. Overall, we demonstrate how a well-designed hybrid can overcome the limitations of its individual components, offering new insights into the design principle of neural memory systems.

Paper Structure

This paper contains 34 sections, 10 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: An illustration for Hybrid Quadratic-Linear Transformers (HQLTs). Two variations are shown. In the "Delayed-Stream" variant (A), the newly generated key/value pair is only fed to the key-value memory (KV-memory) system, and the old key/value pair that falls outside the context window of KV-memory is fed to the fast weight memory (FW-memory) system. In the "Synchronous" variant (B), the key/value pair generated at time step $t$ is fed to both KV-memory and FW-memory systems. The dynamic learning rate variable $\beta_t$ and memory mixing variable $\gamma_t$ are omitted.
  • Figure 2: Evaluation of the Synchronous Hybrid Quadratic-Linear Transformer (HQLT) on reinforcement learning in partially observable environments, using a "passive visual match" task hung2019optimizingNiMEB23. A: In this task, an agent (the beige pixel) navigates in a 2D grid world (of size $7 \times 11$) delimited by impermeable walls (black). The agent can only observe the nearby pixels ($5 \times 5$-grid centered on the agent; illustrated by the light-blue boxes). An episode in this task has three phases. During Phase 1 (whose duration is 15 time steps), the agent observes a color, randomly drawn from three choices, red, green or blue (here blue). In Phase 2 (750 steps), the agent is in a room with apples (green); collecting an apple yields a reward of 1. There are initially 10 apples, and they reappear every 20 steps; their positions are random. In Phase 3 (max. 15 steps), if the agent reaches the pixel with the color that matches the one provided in Phase 1, the episode ends successfully; it yields a reward of 100. Alternatively, Phase 3 terminates if the agent reaches a pixel with the wrong color or when the limit of 15 steps is reached (no reward is given in these cases). B and C show the average return and success rate over 20 test episodes as a function of training environment steps, respectively. We show the average and 95% confidence intervals computed using three training seeds. The variation in success rate is high for Transformer, as one of the seeds consistently achieved the 100% success rate after certain training steps, while other seeds did not. Similarly for HQLT, one of the seeds consistently achieved above 70%.
  • Figure 3: An illustration for the Chunk-Delayed variant of Hybrid Quadratic-Linear Transformers (HQLTs). Here the chunk size is 4. KV-memory only has access to the tokens within the current chunk (i.e., the last three tokens here), whereas FW-memory does not contain any information about the current chunk.