SwinIA: Self-Supervised Blind-Spot Image Denoising without Convolutions
Mikhail Papkov, Pavel Chizhov, Leopold Parts
TL;DR
This work tackles blind-spot self-supervised image denoising without requiring ground-truth clean data or noise-model knowledge. It introduces SwinIA, a fully transformer-based image autoencoder that uses diagonal window attention and multiscale patch embeddings to maintain pixel self-unawareness while enabling long-range interactions. Training uses a plain autoencoder objective $\mathcal{L}(f|\theta)=\mathbb{E}_x\lVert f(x|\theta) - x\rVert^2$ with a single forward pass and no masking, achieving strong performance across grayscale, color, mixed synthetic, and real-world noise datasets. The results establish convolution-free transformers as viable baselines for self-supervised denoising and suggest potential extensions to self-supervised feature learning.
Abstract
Self-supervised image denoising implies restoring the signal from a noisy image without access to the ground truth. State-of-the-art solutions for this task rely on predicting masked pixels with a fully-convolutional neural network. This most often requires multiple forward passes, information about the noise model, or intricate regularization functions. In this paper, we propose a Swin Transformer-based Image Autoencoder (SwinIA), the first fully-transformer architecture for self-supervised denoising. The flexibility of the attention mechanism helps to fulfill the blind-spot property that convolutional counterparts normally approximate. SwinIA can be trained end-to-end with a simple mean squared error loss without masking and does not require any prior knowledge about clean data or noise distribution. Simple to use, SwinIA establishes the state of the art on several common benchmarks.
