Table of Contents
Fetching ...

Exploring Real&Synthetic Dataset and Linear Attention in Image Restoration

Yuzhen Du, Teng Hu, Jiangning Zhang, Ran Yi Chengming Xu, Xiaobin Hu, Kai Wu, Donghao Luo, Yabiao Wang, Lizhuang Ma

TL;DR

The paper tackles IR challenges by exposing bias in training/testing image complexities and proposing ReSyn, a large-scale Real&Synthetic IR benchmark filtered by a GLCM-based complexity metric. It then introduces RWKV-IR, a linear-attention IR model that fuses global-local modeling via DC-Shift and Cross-Bi-WKV within a three-stage restore system, and establishes a unified training standard to enable fair, benchmarked comparisons. Extensive SR, denoising, and JPEG experiments show RWKV-IR achieving strong results and the ReSyn benchmark facilitating robust, cross-dataset evaluation. The work offers practical impact by enabling fairer model comparisons and advancing efficient, scalable IR models using linear attention.

Abstract

Image restoration (IR) aims to recover high-quality images from degraded inputs, with recent deep learning advancements significantly enhancing performance. However, existing methods lack a unified training benchmark for iterations and configurations. We also identify a bias in image complexity distributions between commonly used IR training and testing datasets, resulting in suboptimal restoration outcomes. To address this, we introduce a large-scale IR dataset called ReSyn, which employs a novel image filtering method based on image complexity to ensure a balanced distribution and includes both real and AIGC synthetic images. We establish a unified training standard that specifies iterations and configurations for image restoration models, focusing on measuring model convergence and restoration capability. Additionally, we enhance transformer-based image restoration models using linear attention mechanisms by proposing RWKV-IR, which integrates linear complexity RWKV into the transformer structure, allowing for both global and local receptive fields. Instead of directly using Vision-RWKV, we replace the original Q-Shift in RWKV with a Depth-wise Convolution shift to better model local dependencies, combined with Bi-directional attention for comprehensive linear attention. We also introduce a Cross-Bi-WKV module that merges two Bi-WKV modules with different scanning orders for balanced horizontal and vertical attention. Extensive experiments validate the effectiveness of our RWKV-IR model.

Exploring Real&Synthetic Dataset and Linear Attention in Image Restoration

TL;DR

The paper tackles IR challenges by exposing bias in training/testing image complexities and proposing ReSyn, a large-scale Real&Synthetic IR benchmark filtered by a GLCM-based complexity metric. It then introduces RWKV-IR, a linear-attention IR model that fuses global-local modeling via DC-Shift and Cross-Bi-WKV within a three-stage restore system, and establishes a unified training standard to enable fair, benchmarked comparisons. Extensive SR, denoising, and JPEG experiments show RWKV-IR achieving strong results and the ReSyn benchmark facilitating robust, cross-dataset evaluation. The work offers practical impact by enabling fairer model comparisons and advancing efficient, scalable IR models using linear attention.

Abstract

Image restoration (IR) aims to recover high-quality images from degraded inputs, with recent deep learning advancements significantly enhancing performance. However, existing methods lack a unified training benchmark for iterations and configurations. We also identify a bias in image complexity distributions between commonly used IR training and testing datasets, resulting in suboptimal restoration outcomes. To address this, we introduce a large-scale IR dataset called ReSyn, which employs a novel image filtering method based on image complexity to ensure a balanced distribution and includes both real and AIGC synthetic images. We establish a unified training standard that specifies iterations and configurations for image restoration models, focusing on measuring model convergence and restoration capability. Additionally, we enhance transformer-based image restoration models using linear attention mechanisms by proposing RWKV-IR, which integrates linear complexity RWKV into the transformer structure, allowing for both global and local receptive fields. Instead of directly using Vision-RWKV, we replace the original Q-Shift in RWKV with a Depth-wise Convolution shift to better model local dependencies, combined with Bi-directional attention for comprehensive linear attention. We also introduce a Cross-Bi-WKV module that merges two Bi-WKV modules with different scanning orders for balanced horizontal and vertical attention. Extensive experiments validate the effectiveness of our RWKV-IR model.

Paper Structure

This paper contains 21 sections, 2 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: The diversity analysis of our ReSyn dataset. It contains both real and synthetic images from a variety of data sources and covers a wide range of resolutions.
  • Figure 2: The complexity distributions of different datasets. The complexity distributions of the training datasets DIV2K timofte2017ntire and DF2K lim2017enhanced have a typical shift, containing more images of low complexity. Our ReSyn dataset balances the distribution of low and high complexity images by image filtering based on the newly proposed GLCM image complexity measure.
  • Figure 3: PSNR ($\times$2 SR on Urban100 huang2015single) performance can be predicted by the proposed GLCM image complexity and BPP timofte2017ntire. We conduct this analysis on MambaIR, SwinIR, and bicubic upsampling restored images.
  • Figure 4: Framework of our RWKV-IR, which consists of three stages: shallow feature extraction, deep feature enhancing, and HQ image reconstruction. For deep feature enhancing, a series of Global$\&$Local Linear attention Layers (GLLL, which is based on RWKV) and a Conv Block are used. Each GLLL contains several GLLB blocks.
  • Figure 5: Different shift methods. The Q-shift is a simple channel replacement operation using four neighboring pixels, while our DC-shift is a depth-wise conv leveraging the surrounding pixels in a $k \times k$ neighborhood.
  • ...and 2 more figures