Table of Contents
Fetching ...

From Markov to Laplace: How Mamba In-Context Learns Markov Chains

Marco Bondaschi, Nived Rajaraman, Xiuying Wei, Kannan Ramchandran, Razvan Pascanu, Caglar Gulcehre, Michael Gastpar, Ashok Vardhan Makkuva

TL;DR

This paper investigates Mamba's in-context learning (ICL) capabilities by analyzing random Markov chain data. It reveals that a single-layer Mamba can efficiently implement the in-context Laplacian smoothing estimator, $\mathbb{P}_{\beta}^{(k)}(x_{t+1}=1 \mid x_1^t) = \frac{n_1+\beta}{n+2\beta}$, with convolution identified as the crucial mechanism enabling this behavior. The authors introduce MambaZero, a simplified model, and prove that for order-1 Markov data there exist parameters making the model's predictions arbitrarily close to the Laplacian estimator (with $D_{KL}$ bound $\le\epsilon$); they further show that depth and window size constraints govern the feasibility of such representations for higher orders. Extending beyond Markov data, they demonstrate the relevance of convolution in natural language tasks (e.g., WikiText-103), where convolution substantially improves Mamba-2 perplexity relative to non-convolution variants, signaling broad practical significance for efficient sequence modeling and ICL.

Abstract

While transformer-based language models have driven the AI revolution thus far, their computational complexity has spurred growing interest in viable alternatives, such as structured state space sequence models (SSMs) and Selective SSMs. Among these, Mamba (S6) and its variant Mamba-2 have shown remarkable inference speed ups over transformers while achieving comparable or superior performance on complex language modeling tasks. However, despite these architectural innovations and empirical successes, the fundamental learning capabilities of Mamba remain poorly understood. In this paper, we address this gap by studying in-context learning (ICL) on Markov chains and uncovering a surprising phenomenon: unlike transformers, even a single-layer Mamba efficiently learns the in-context Laplacian smoothing estimator, which is both Bayes and minimax optimal, for all Markovian orders. To explain this, we theoretically characterize the representation capacity of Mamba and reveal the fundamental role of convolution in enabling it to represent the optimal Laplacian smoothing. These theoretical insights align strongly with empirical results and, to the best of our knowledge, represent the first formal connection between Mamba and optimal statistical estimators. Finally, we outline promising research directions inspired by these findings.

From Markov to Laplace: How Mamba In-Context Learns Markov Chains

TL;DR

This paper investigates Mamba's in-context learning (ICL) capabilities by analyzing random Markov chain data. It reveals that a single-layer Mamba can efficiently implement the in-context Laplacian smoothing estimator, , with convolution identified as the crucial mechanism enabling this behavior. The authors introduce MambaZero, a simplified model, and prove that for order-1 Markov data there exist parameters making the model's predictions arbitrarily close to the Laplacian estimator (with bound ); they further show that depth and window size constraints govern the feasibility of such representations for higher orders. Extending beyond Markov data, they demonstrate the relevance of convolution in natural language tasks (e.g., WikiText-103), where convolution substantially improves Mamba-2 perplexity relative to non-convolution variants, signaling broad practical significance for efficient sequence modeling and ICL.

Abstract

While transformer-based language models have driven the AI revolution thus far, their computational complexity has spurred growing interest in viable alternatives, such as structured state space sequence models (SSMs) and Selective SSMs. Among these, Mamba (S6) and its variant Mamba-2 have shown remarkable inference speed ups over transformers while achieving comparable or superior performance on complex language modeling tasks. However, despite these architectural innovations and empirical successes, the fundamental learning capabilities of Mamba remain poorly understood. In this paper, we address this gap by studying in-context learning (ICL) on Markov chains and uncovering a surprising phenomenon: unlike transformers, even a single-layer Mamba efficiently learns the in-context Laplacian smoothing estimator, which is both Bayes and minimax optimal, for all Markovian orders. To explain this, we theoretically characterize the representation capacity of Mamba and reveal the fundamental role of convolution in enabling it to represent the optimal Laplacian smoothing. These theoretical insights align strongly with empirical results and, to the best of our knowledge, represent the first formal connection between Mamba and optimal statistical estimators. Finally, we outline promising research directions inspired by these findings.

Paper Structure

This paper contains 23 sections, 4 theorems, 60 equations, 5 figures, 5 tables.

Key Result

Theorem 1

For the canonical $\mathsf{MambaZero}$ model with dimensions $d = N = 2$, $e=1$, and convolution window $w=2$, there is a choice of parameters such that the model prediction is arbitrarily close to the Laplacian estimator for random first-order Markov chains. More formally, for any $\beta > 0$ and $

Figures (5)

  • Figure 1: Single-layer Mamba learns the optimal Laplacian smoothing when trained on random Markov chains, exhibiting in-context learning. A two-layer transformer also learns the same, albeit less precisely. In contrast, a single-layer transformer fails to solve this task. We observe the same phenomenon for various Markov orders.
  • Figure 2: Mamba-based language model with binary input data: for each $t\in [T]$, the next-token prediction probability is $f_{\boldsymbol{\theta}}(x_1^t)= \mathbb{P}_{\boldsymbol{\theta}}\left( x_{t+1}=\cdot \mid x_1^t \right)$.
  • Figure 3: (a) illustrates the fundamental role of convolution, without which the model fails to learn the task. In contrast, a simplified variant with just the convolution ($\mathsf{MambaZero}$) matches the performance of that of the full model. (b) highlights the relation between the Markov order $k$ and the window size $w$ of $\mathsf{Mamba}$. It is required that $w \geq k+1$ for the model to learn the order-$k$ prediction task.
  • Figure 4: Single-layer Mamba on data generated from the switching Markov process with $p_{\rm switch} = 0.01$. The red vertical lines mark the positions of switch tokens. Figure (a) shows that the model's prediction follows very precisely that of the optimal estimator also in this more complex scenario. Figure (b) highlights the selectivity process of the model: every time a switch token appears, the model erases all information about the past by setting $a_t=0$.
  • Figure 5: Value of $a_t$ across positions at convergence.

Theorems & Definitions (7)

  • Theorem 1: $\mathsf{MambaZero}$ represents order-$1$ Laplacian smoothing
  • Conjecture 1
  • Theorem 2
  • Theorem 3
  • proof
  • Lemma 1
  • proof