Arcee: Differentiable Recurrent State Chain for Generative Vision Modeling with Mamba SSMs
Jitesh Chavan, Rohit Lal, Anand Kamat, Mengjia Xu
TL;DR
Arcee introduces a cross-block recurrent state chain for Mamba-based vision models that reuses terminal state-space representations across blocks, enabling end-to-end gradient flow with zero additional parameters. The method augments the conventional selective-scan by feeding $h_T$ from block $l-1$ as the initial state for block $l$ through a differentiable boundary map, creating a mild, architecture-agnostic inductive bias. Empirically, Arcee yields substantial improvements in unconditional image generation on CelebA-HQ with Flow Matching, notably reducing FID from $82.81$ to $15.33$ on a naive ZigZag baseline, while also benefiting other backbones with consistent gains and negligible overhead. The results support Arcee as a plug-and-play enhancement for non-sequential signals, with a theoretical framing around cross-block Jacobians and a low-rank cross-block influence due to the SSR bottleneck. The work points to broad applicability across modalities and scan orders, and suggests directions for conditioning selective-scan dynamics on cross-modal priors and extending the approach to video and audio domains.
Abstract
State-space models (SSMs), Mamba in particular, are increasingly adopted for long-context sequence modeling, providing linear-time aggregation via an input-dependent, causal selective-scan operation. Along this line, recent "Mamba-for-vision" variants largely explore multiple scan orders to relax strict causality for non-sequential signals (e.g., images). Rather than preserving cross-block memory, the conventional formulation of the selective-scan operation in Mamba reinitializes each block's state-space dynamics from zero, discarding the terminal state-space representation (SSR) from the previous block. Arcee, a cross-block recurrent state chain, reuses each block's terminal state-space representation as the initial condition for the next block. Handoff across blocks is constructed as a differentiable boundary map whose Jacobian enables end-to-end gradient flow across terminal boundaries. Key to practicality, Arcee is compatible with all prior "vision-mamba" variants, parameter-free, and incurs constant, negligible cost. As a modeling perspective, we view terminal SSR as a mild directional prior induced by a causal pass over the input, rather than an estimator of the non-sequential signal itself. To quantify the impact, for unconditional generation on CelebA-HQ (256$\times$256) with Flow Matching, Arcee reduces FID$\downarrow$ from $82.81$ to $15.33$ ($5.4\times$ lower) on a single scan-order Zigzag Mamba baseline. Efficient CUDA kernels and training code will be released to support rigorous and reproducible research.
