Table of Contents
Fetching ...

SwinIA: Self-Supervised Blind-Spot Image Denoising without Convolutions

Mikhail Papkov, Pavel Chizhov, Leopold Parts

TL;DR

This work tackles blind-spot self-supervised image denoising without requiring ground-truth clean data or noise-model knowledge. It introduces SwinIA, a fully transformer-based image autoencoder that uses diagonal window attention and multiscale patch embeddings to maintain pixel self-unawareness while enabling long-range interactions. Training uses a plain autoencoder objective $\mathcal{L}(f|\theta)=\mathbb{E}_x\lVert f(x|\theta) - x\rVert^2$ with a single forward pass and no masking, achieving strong performance across grayscale, color, mixed synthetic, and real-world noise datasets. The results establish convolution-free transformers as viable baselines for self-supervised denoising and suggest potential extensions to self-supervised feature learning.

Abstract

Self-supervised image denoising implies restoring the signal from a noisy image without access to the ground truth. State-of-the-art solutions for this task rely on predicting masked pixels with a fully-convolutional neural network. This most often requires multiple forward passes, information about the noise model, or intricate regularization functions. In this paper, we propose a Swin Transformer-based Image Autoencoder (SwinIA), the first fully-transformer architecture for self-supervised denoising. The flexibility of the attention mechanism helps to fulfill the blind-spot property that convolutional counterparts normally approximate. SwinIA can be trained end-to-end with a simple mean squared error loss without masking and does not require any prior knowledge about clean data or noise distribution. Simple to use, SwinIA establishes the state of the art on several common benchmarks.

SwinIA: Self-Supervised Blind-Spot Image Denoising without Convolutions

TL;DR

This work tackles blind-spot self-supervised image denoising without requiring ground-truth clean data or noise-model knowledge. It introduces SwinIA, a fully transformer-based image autoencoder that uses diagonal window attention and multiscale patch embeddings to maintain pixel self-unawareness while enabling long-range interactions. Training uses a plain autoencoder objective with a single forward pass and no masking, achieving strong performance across grayscale, color, mixed synthetic, and real-world noise datasets. The results establish convolution-free transformers as viable baselines for self-supervised denoising and suggest potential extensions to self-supervised feature learning.

Abstract

Self-supervised image denoising implies restoring the signal from a noisy image without access to the ground truth. State-of-the-art solutions for this task rely on predicting masked pixels with a fully-convolutional neural network. This most often requires multiple forward passes, information about the noise model, or intricate regularization functions. In this paper, we propose a Swin Transformer-based Image Autoencoder (SwinIA), the first fully-transformer architecture for self-supervised denoising. The flexibility of the attention mechanism helps to fulfill the blind-spot property that convolutional counterparts normally approximate. SwinIA can be trained end-to-end with a simple mean squared error loss without masking and does not require any prior knowledge about clean data or noise distribution. Simple to use, SwinIA establishes the state of the art on several common benchmarks.
Paper Structure (21 sections, 5 equations, 12 figures, 7 tables)

This paper contains 21 sections, 5 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: Self-unaware autoencoding in text and images.
  • Figure 2: SwinIA model. Multiscale positional embeddings act as queries for the encoder (three parallel blocks) and are added to patch embeddings to create constant keys and values. The decoder (two remaining blocks) fuses the extracted features into the final output image. Encoder and decoder blocks have an identical structure and consist of four transformer blocks with cyclically shifted window attention.
  • Figure 3: Pixel shuffle example of an image of size $8\times 8$ with window size $4$ into patches of sizes $1\times1$ and $2\times2$.
  • Figure 4: SwinIA transformer block architecture. MSA and MLP are preceded by layer normalization and complemented with a shortcut by addition. Only queries are normalized before the MSA because keys and values are normalized upon creation. The attention is performed between shuffled patches. The attention matrix is diagonally masked to maintain pixel self-unawareness.
  • Figure 5: Kodak franzen1999kodak (top), ImageNet n2same (middle), and FMD Two-Photon Mice zhang2018poisson (bottom) denoising examples. Every predicted image is cropped to a square for visualization presented with the corresponding PSNR score (in dB).
  • ...and 7 more figures