LMLT: Low-to-high Multi-Level Vision Transformer for Image Super-Resolution

Jeongsoo Kim; Jongho Nang; Junsuk Choe

LMLT: Low-to-high Multi-Level Vision Transformer for Image Super-Resolution

Jeongsoo Kim, Jongho Nang, Junsuk Choe

TL;DR

The Low-to-high Multi-Level Transformer (LMLT), which employs attention with varying feature sizes for each head, overcomes the window boundary issues in self-attention and significantly reduces inference time and GPU memory usage.

Abstract

Recent Vision Transformer (ViT)-based methods for Image Super-Resolution have demonstrated impressive performance. However, they suffer from significant complexity, resulting in high inference times and memory usage. Additionally, ViT models using Window Self-Attention (WSA) face challenges in processing regions outside their windows. To address these issues, we propose the Low-to-high Multi-Level Transformer (LMLT), which employs attention with varying feature sizes for each head. LMLT divides image features along the channel dimension, gradually reduces spatial size for lower heads, and applies self-attention to each head. This approach effectively captures both local and global information. By integrating the results from lower heads into higher heads, LMLT overcomes the window boundary issues in self-attention. Extensive experiments show that our model significantly reduces inference time and GPU memory usage while maintaining or even surpassing the performance of state-of-the-art ViT-based Image Super-Resolution methods. Our codes are availiable at https://github.com/jwgdmkj/LMLT.

LMLT: Low-to-high Multi-Level Vision Transformer for Image Super-Resolution

TL;DR

Abstract

Paper Structure (13 sections, 2 equations, 13 figures, 14 tables)

This paper contains 13 sections, 2 equations, 13 figures, 14 tables.

Introduction
Related Works
Proposed Method
Experiments
Comparisons with State-of-the-Art Methods
Ablation Study
Conclusion
Impact of number of Blocks, Channels, Heads and Depths
Effects of Low-to-high connection and Pooling
Impact of Activation function
LAM and ERF Comparisons
CCM : Convolutional Channel Mixer
Comparisons on LMLT with Other Methods

Figures (13)

Figure 1: Left PSNR comparison of our proposed LMLT and other state-of-the-art models when upscaling Manga109 by 3 times. The size of each circle represents the number of parameters. Our model achieves comparable performance in terms of FLOPs when the channels are set to 36, 36 with 12 blocks, 60, and 84. Right (a) The conventional Self-Attention block stacks multiple Self-Attention layers in series. (b) Our proposed Self-Attention block stacks the layers in parallel. Here, SAL stands for Self Attenion Layer.
Figure 2: The architecture overview of the proposed method.
Figure 3: Self-attention(SA) at different spatial resolutions of the image with the same window size.
Figure 4: Features from each head ((a) to (d)), aggregated feature (e), and feature multiplied with the original feature (f).
Figure 5: Visual comparisons for $\times 4$ SR on Manga109 dataset. Compared with the results in (c) to (f), the Ours(LMLT-Base(g), LMLT-Large(h)) restore much more accurate and clear images. More results are in the Appendix \ref{['App:qualitative']}.
...and 8 more figures

LMLT: Low-to-high Multi-Level Vision Transformer for Image Super-Resolution

TL;DR

Abstract

LMLT: Low-to-high Multi-Level Vision Transformer for Image Super-Resolution

Authors

TL;DR

Abstract

Table of Contents

Figures (13)