Serpent: Scalable and Efficient Image Restoration via Multi-scale Structured State Space Models

Mohammad Shahab Sepehri; Zalan Fabian; Mahdi Soltanolkotabi

Serpent: Scalable and Efficient Image Restoration via Multi-scale Structured State Space Models

Mohammad Shahab Sepehri, Zalan Fabian, Mahdi Soltanolkotabi

TL;DR

Serpent tackles the challenge of high-resolution image restoration by marrying structured state space models with multi-scale, patch-based processing in a U-Net–like architecture. By employing selective SSMs with four-direction unrolling, it achieves long-range dependency modeling with linear-like scaling in input size, enabling large, high-resolution restorations at a fraction of the compute and memory of attention-based methods. The approach matches or surpasses state-of-the-art methods such as Restormer and SwinIR on Gaussian deblurring and 8× super-resolution tasks, while delivering up to 150× FLOPS reductions and significant memory savings, especially at 512× resolutions. These efficiency gains open the door to practical high-resolution restoration on standard GPUs and suggest a promising direction for scalable dense-vision models built on SSMs, with caveats around software/hardware support and training data bias risks.

Abstract

The landscape of computational building blocks of efficient image restoration architectures is dominated by a combination of convolutional processing and various attention mechanisms. However, convolutional filters, while efficient, are inherently local and therefore struggle with modeling long-range dependencies in images. In contrast, attention excels at capturing global interactions between arbitrary image regions, but suffers from a quadratic cost in image dimension. In this work, we propose Serpent, an efficient architecture for high-resolution image restoration that combines recent advances in state space models (SSMs) with multi-scale signal processing in its core computational block. SSMs, originally introduced for sequence modeling, can maintain a global receptive field with a favorable linear scaling in input size. We propose a novel hierarchical architecture inspired by traditional signal processing principles, that converts the input image into a collection of sequences and processes them in a multi-scale fashion. Our experimental results demonstrate that Serpent can achieve reconstruction quality on par with state-of-the-art techniques, while requiring orders of magnitude less compute (up to $150$ fold reduction in FLOPS) and a factor of up to $5\times$ less GPU memory while maintaining a compact model size. The efficiency gains achieved by Serpent are especially notable at high image resolutions.

Serpent: Scalable and Efficient Image Restoration via Multi-scale Structured State Space Models

TL;DR

Abstract

fold reduction in FLOPS) and a factor of up to

less GPU memory while maintaining a compact model size. The efficiency gains achieved by Serpent are especially notable at high image resolutions.

Paper Structure (15 sections, 3 equations, 7 figures, 2 tables)

This paper contains 15 sections, 3 equations, 7 figures, 2 tables.

Introduction
Background
State space models
Selective SSMs
SSMs in vision
Method
Serpent architecture
Serpent block
Merging and expanding patches
Experiments
Setup
Performance results
Efficiency results
Conclusion
Further training details

Figures (7)

Figure 1: Efficiency of Serpent: Serpent out-scales state-of-the-art reconstruction techniques in terms of FLOPS, especially at high image resolutions (left). Our technique matches the performance of state-of-the-art methods while utilizing $5\times$ less GPU memory during training, matching the scaling of fully convolutional architectures (right). We plot results with batch size $1$ on a single H100 GPU with $80$ GB memory. Missing data points indicate that the given model is out of memory with the specific input resolution.
Figure 2: The VSS block yu_vmamba_2024 scans the image along four different unrolled direction using state space models (SS2D). Linear layers are used to create feature embeddings, for gating and to produce the final output.
Figure 3: Overview of our Serpent. Serpent has a U-Net architecture and use S-blocks at each layer. The S-block D is consist of $n$ VSS block in sequence.
Figure 4: Visual comparison of reconstructions on the FFHQ $512\times$ deblurring task.
Figure 5: Visual comparison of reconstructions on the FFHQ $8\times$ superresolution task.
...and 2 more figures

Serpent: Scalable and Efficient Image Restoration via Multi-scale Structured State Space Models

TL;DR

Abstract

Serpent: Scalable and Efficient Image Restoration via Multi-scale Structured State Space Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)