Table of Contents
Fetching ...

Mimetic Initialization Helps State Space Models Learn to Recall

Asher Trockman, Hrayr Harutyunyan, J. Zico Kolter, Sanjiv Kumar, Srinadh Bhojanapalli

TL;DR

This work proposes a structured initialization technique that allows state space layers to more readily mimic attention and makes it substantially easier for Mamba to learn to copy and do associative recall from scratch.

Abstract

Recent work has shown that state space models such as Mamba are significantly worse than Transformers on recall-based tasks due to the fact that their state size is constant with respect to their input sequence length. But in practice, state space models have fairly large state sizes, and we conjecture that they should be able to perform much better at these tasks than previously reported. We investigate whether their poor copying and recall performance could be due in part to training difficulties rather than fundamental capacity constraints. Based on observations of their "attention" maps, we propose a structured initialization technique that allows state space layers to more readily mimic attention. Across a variety of architecture settings, our initialization makes it substantially easier for Mamba to learn to copy and do associative recall from scratch.

Mimetic Initialization Helps State Space Models Learn to Recall

TL;DR

This work proposes a structured initialization technique that allows state space layers to more readily mimic attention and makes it substantially easier for Mamba to learn to copy and do associative recall from scratch.

Abstract

Recent work has shown that state space models such as Mamba are significantly worse than Transformers on recall-based tasks due to the fact that their state size is constant with respect to their input sequence length. But in practice, state space models have fairly large state sizes, and we conjecture that they should be able to perform much better at these tasks than previously reported. We investigate whether their poor copying and recall performance could be due in part to training difficulties rather than fundamental capacity constraints. Based on observations of their "attention" maps, we propose a structured initialization technique that allows state space layers to more readily mimic attention. Across a variety of architecture settings, our initialization makes it substantially easier for Mamba to learn to copy and do associative recall from scratch.

Paper Structure

This paper contains 27 sections, 10 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Mambas initialized with our technique learn to copy more effectively than those with default initialization. We see evidence of copying ability in the Mamba attention maps; see Layer 1.
  • Figure 2: A hybrid Mamba architecture with one Self-Attention layer easily learns to copy. Dotted lines: performance on training length (50), solid: $2\times$ length generalization (100).
  • Figure 3: Testing the four components of our initialization on Mamba 1 & 2 for 10 seeds.
  • Figure 4: Mimetic-initialized Mamba layers learn similar operations to Self-Attention layers in the same location naturally with no additional supervision on several tasks. Dotted lines: accuracy at training length (50), solid lines: generalizing to length 100.
  • Figure 5: Simple linear attention underperforms Mamba even for very high head dimension, especially at generalization. Dotted lines: accuracy at length 100, solid: at length 200; train length: 50.
  • ...and 6 more figures