Table of Contents
Fetching ...

Serpent: Scalable and Efficient Image Restoration via Multi-scale Structured State Space Models

Mohammad Shahab Sepehri, Zalan Fabian, Mahdi Soltanolkotabi

TL;DR

Serpent tackles the challenge of high-resolution image restoration by marrying structured state space models with multi-scale, patch-based processing in a U-Net–like architecture. By employing selective SSMs with four-direction unrolling, it achieves long-range dependency modeling with linear-like scaling in input size, enabling large, high-resolution restorations at a fraction of the compute and memory of attention-based methods. The approach matches or surpasses state-of-the-art methods such as Restormer and SwinIR on Gaussian deblurring and 8× super-resolution tasks, while delivering up to 150× FLOPS reductions and significant memory savings, especially at 512× resolutions. These efficiency gains open the door to practical high-resolution restoration on standard GPUs and suggest a promising direction for scalable dense-vision models built on SSMs, with caveats around software/hardware support and training data bias risks.

Abstract

The landscape of computational building blocks of efficient image restoration architectures is dominated by a combination of convolutional processing and various attention mechanisms. However, convolutional filters, while efficient, are inherently local and therefore struggle with modeling long-range dependencies in images. In contrast, attention excels at capturing global interactions between arbitrary image regions, but suffers from a quadratic cost in image dimension. In this work, we propose Serpent, an efficient architecture for high-resolution image restoration that combines recent advances in state space models (SSMs) with multi-scale signal processing in its core computational block. SSMs, originally introduced for sequence modeling, can maintain a global receptive field with a favorable linear scaling in input size. We propose a novel hierarchical architecture inspired by traditional signal processing principles, that converts the input image into a collection of sequences and processes them in a multi-scale fashion. Our experimental results demonstrate that Serpent can achieve reconstruction quality on par with state-of-the-art techniques, while requiring orders of magnitude less compute (up to $150$ fold reduction in FLOPS) and a factor of up to $5\times$ less GPU memory while maintaining a compact model size. The efficiency gains achieved by Serpent are especially notable at high image resolutions.

Serpent: Scalable and Efficient Image Restoration via Multi-scale Structured State Space Models

TL;DR

Serpent tackles the challenge of high-resolution image restoration by marrying structured state space models with multi-scale, patch-based processing in a U-Net–like architecture. By employing selective SSMs with four-direction unrolling, it achieves long-range dependency modeling with linear-like scaling in input size, enabling large, high-resolution restorations at a fraction of the compute and memory of attention-based methods. The approach matches or surpasses state-of-the-art methods such as Restormer and SwinIR on Gaussian deblurring and 8× super-resolution tasks, while delivering up to 150× FLOPS reductions and significant memory savings, especially at 512× resolutions. These efficiency gains open the door to practical high-resolution restoration on standard GPUs and suggest a promising direction for scalable dense-vision models built on SSMs, with caveats around software/hardware support and training data bias risks.

Abstract

The landscape of computational building blocks of efficient image restoration architectures is dominated by a combination of convolutional processing and various attention mechanisms. However, convolutional filters, while efficient, are inherently local and therefore struggle with modeling long-range dependencies in images. In contrast, attention excels at capturing global interactions between arbitrary image regions, but suffers from a quadratic cost in image dimension. In this work, we propose Serpent, an efficient architecture for high-resolution image restoration that combines recent advances in state space models (SSMs) with multi-scale signal processing in its core computational block. SSMs, originally introduced for sequence modeling, can maintain a global receptive field with a favorable linear scaling in input size. We propose a novel hierarchical architecture inspired by traditional signal processing principles, that converts the input image into a collection of sequences and processes them in a multi-scale fashion. Our experimental results demonstrate that Serpent can achieve reconstruction quality on par with state-of-the-art techniques, while requiring orders of magnitude less compute (up to fold reduction in FLOPS) and a factor of up to less GPU memory while maintaining a compact model size. The efficiency gains achieved by Serpent are especially notable at high image resolutions.
Paper Structure (15 sections, 3 equations, 7 figures, 2 tables)

This paper contains 15 sections, 3 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Efficiency of Serpent: Serpent out-scales state-of-the-art reconstruction techniques in terms of FLOPS, especially at high image resolutions (left). Our technique matches the performance of state-of-the-art methods while utilizing $5\times$ less GPU memory during training, matching the scaling of fully convolutional architectures (right). We plot results with batch size $1$ on a single H100 GPU with $80$ GB memory. Missing data points indicate that the given model is out of memory with the specific input resolution.
  • Figure 2: The VSS block yu_vmamba_2024 scans the image along four different unrolled direction using state space models (SS2D). Linear layers are used to create feature embeddings, for gating and to produce the final output.
  • Figure 3: Overview of our Serpent. Serpent has a U-Net architecture and use S-blocks at each layer. The S-block D is consist of $n$ VSS block in sequence.
  • Figure 4: Visual comparison of reconstructions on the FFHQ $512\times$ deblurring task.
  • Figure 5: Visual comparison of reconstructions on the FFHQ $8\times$ superresolution task.
  • ...and 2 more figures