Serpent: Scalable and Efficient Image Restoration via Multi-scale Structured State Space Models
Mohammad Shahab Sepehri, Zalan Fabian, Mahdi Soltanolkotabi
TL;DR
Serpent tackles the challenge of high-resolution image restoration by marrying structured state space models with multi-scale, patch-based processing in a U-Net–like architecture. By employing selective SSMs with four-direction unrolling, it achieves long-range dependency modeling with linear-like scaling in input size, enabling large, high-resolution restorations at a fraction of the compute and memory of attention-based methods. The approach matches or surpasses state-of-the-art methods such as Restormer and SwinIR on Gaussian deblurring and 8× super-resolution tasks, while delivering up to 150× FLOPS reductions and significant memory savings, especially at 512× resolutions. These efficiency gains open the door to practical high-resolution restoration on standard GPUs and suggest a promising direction for scalable dense-vision models built on SSMs, with caveats around software/hardware support and training data bias risks.
Abstract
The landscape of computational building blocks of efficient image restoration architectures is dominated by a combination of convolutional processing and various attention mechanisms. However, convolutional filters, while efficient, are inherently local and therefore struggle with modeling long-range dependencies in images. In contrast, attention excels at capturing global interactions between arbitrary image regions, but suffers from a quadratic cost in image dimension. In this work, we propose Serpent, an efficient architecture for high-resolution image restoration that combines recent advances in state space models (SSMs) with multi-scale signal processing in its core computational block. SSMs, originally introduced for sequence modeling, can maintain a global receptive field with a favorable linear scaling in input size. We propose a novel hierarchical architecture inspired by traditional signal processing principles, that converts the input image into a collection of sequences and processes them in a multi-scale fashion. Our experimental results demonstrate that Serpent can achieve reconstruction quality on par with state-of-the-art techniques, while requiring orders of magnitude less compute (up to $150$ fold reduction in FLOPS) and a factor of up to $5\times$ less GPU memory while maintaining a compact model size. The efficiency gains achieved by Serpent are especially notable at high image resolutions.
