Table of Contents
Fetching ...

Multi-Scale Implicit Transformer with Re-parameterize for Arbitrary-Scale Super-Resolution

Jinchen Zhu, Mingjian Zhang, Ling Zheng, Shizhuang Weng

TL;DR

This work designs Multi-Scale Implicit Transformer (MSIT), consisting of an Multi-scale Neural Operator (MSNO) and Multi-Scale Self-Attention (MSSA) and proposes the Re-Interaction Module (RIM) combined with the cumulative training strategy to improve the diversity of learned information for the network.

Abstract

Recently, the methods based on implicit neural representations have shown excellent capabilities for arbitrary-scale super-resolution (ASSR). Although these methods represent the features of an image by generating latent codes, these latent codes are difficult to adapt for different magnification factors of super-resolution, which seriously affects their performance. Addressing this, we design Multi-Scale Implicit Transformer (MSIT), consisting of an Multi-scale Neural Operator (MSNO) and Multi-Scale Self-Attention (MSSA). Among them, MSNO obtains multi-scale latent codes through feature enhancement, multi-scale characteristics extraction, and multi-scale characteristics merging. MSSA further enhances the multi-scale characteristics of latent codes, resulting in better performance. Furthermore, to improve the performance of network, we propose the Re-Interaction Module (RIM) combined with the cumulative training strategy to improve the diversity of learned information for the network. We have systematically introduced multi-scale characteristics for the first time in ASSR, extensive experiments are performed to validate the effectiveness of MSIT, and our method achieves state-of-the-art performance in arbitrary super-resolution tasks.

Multi-Scale Implicit Transformer with Re-parameterize for Arbitrary-Scale Super-Resolution

TL;DR

This work designs Multi-Scale Implicit Transformer (MSIT), consisting of an Multi-scale Neural Operator (MSNO) and Multi-Scale Self-Attention (MSSA) and proposes the Re-Interaction Module (RIM) combined with the cumulative training strategy to improve the diversity of learned information for the network.

Abstract

Recently, the methods based on implicit neural representations have shown excellent capabilities for arbitrary-scale super-resolution (ASSR). Although these methods represent the features of an image by generating latent codes, these latent codes are difficult to adapt for different magnification factors of super-resolution, which seriously affects their performance. Addressing this, we design Multi-Scale Implicit Transformer (MSIT), consisting of an Multi-scale Neural Operator (MSNO) and Multi-Scale Self-Attention (MSSA). Among them, MSNO obtains multi-scale latent codes through feature enhancement, multi-scale characteristics extraction, and multi-scale characteristics merging. MSSA further enhances the multi-scale characteristics of latent codes, resulting in better performance. Furthermore, to improve the performance of network, we propose the Re-Interaction Module (RIM) combined with the cumulative training strategy to improve the diversity of learned information for the network. We have systematically introduced multi-scale characteristics for the first time in ASSR, extensive experiments are performed to validate the effectiveness of MSIT, and our method achieves state-of-the-art performance in arbitrary super-resolution tasks.
Paper Structure (23 sections, 11 equations, 9 figures, 5 tables)

This paper contains 23 sections, 11 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Mean error maps mem for SR results with different magnification factors, where a brighter color indicates a larger error. For SR with large magnification factor the network has less error on texture compared to SR with small magnification factor, indicating that SR with large magnification factor focuses more on the detailed texture of the target. In contrast, at small magnification, each pixel of the image will only be slightly enlarged, thus the overall structure of the image (e.g., shape, position of the object, etc.) will be preserved, so it is more focused on the overall shape of the target.
  • Figure 2: Overall architecture for continuous image SR
  • Figure 3: (a) Schematic Overview of MSNO Structure: Initially, the input enrichs features through FEM. Subsequently, MSC was used to obtain multi-scale latent codes. Lastly, SIM is utilized for scale mixing within the same scales and across different scales, enhancing feature diversity. (b) MSSA first obtains $Q$ and $K$ by aggregating feature of different scales through parallel convolution. Subsequently, $Q$ and $K$ are interpolated for $\hat{\chi}^h$ and $\hat{\chi}^l$ to obtain $\hat{Q}$ and $\hat{K}$. Subsequently, attention weights are computed by using $\hat{Q}$, $\hat{K}$, and relative coordinates. Finally, a product with the $\hat{V}$ is executed to generate attention latent codes $\mathcal{Z}^A$.
  • Figure 4: The process of re-parameterizing a training strategy, where the encoder and decoder are omitted.
  • Figure 5: Qualitative comparison of MSIT with using RDN as the encoder.
  • ...and 4 more figures